Technical documentation
Using Fiuncho as a library
If you are a developer and want to use any of the C++ classes and methods
implemented in this program, you can do so by linking to the main library
libfiuncho declared in the CMake project.
The following example shows a short program that makes use of Fiuncho’s methods, and its corresponding CMake project configuration. Assume the following project structure:
myproject/
├── build/
├── fiuncho/
├── CMakeLists.txt
└── main.cpp
The directory fiuncho/ contains the source code of Fiuncho and build/ is
the directory where we will build the project from its sources. The program
main.cpp simply reads a data set using the Dataset class
provided in Fiuncho, and prints the number of SNPs read:
1#include <fiuncho/Dataset.h>
2#include <iostream>
3
4int main(int argc, char **argv)
5{
6 // Arguments to the program are the tped and tfam input files
7 if (argc < 3){
8 return 1;
9 }
10 const std::string tped(argv[1]);
11 const std::string tfam(argv[2]);
12 // Read data
13 const auto dataset = Dataset<uint64_t>::read({tped, tfam});
14 // Print the number of SNPs read
15 std::cout << dataset.snps << std::endl;
16 // Exit
17 return 0;
18}
The CMakeLists.txt file will declare the project, include fiuncho as a CMake
subdirectory, declare the main executable target and link it to libfiuncho:
1cmake_minimum_required(VERSION 3.11)
2project(Myproject VERSION 1 LANGUAGES CXX)
3set(CMAKE_CXX_STANDARD 14)
4# Add fiuncho project as a subdirectory
5add_subdirectory(fiuncho)
6# Add our main executable
7add_executable(main main.cpp)
8# Link it against fiuncho's library
9target_link_libraries(main libfiuncho)
libfiuncho’s build can be configured using the same CMake variables defined
in the installation-and-usage:Advanced configuration subsection.
Classes documentation
Dataset class
-
template<class T>
class Dataset Class representing a collection of N GenotypeTable’s, each containing a single SNP, for M individuals.
- tparam T
data type used to represent the individual information in the GenotypeTable’s
Factory methods
-
static inline Dataset<T> read(const std::vector<std::string> &inputs)
Read input data and store it using a GenotypeTable representation. The underlying arrays used in the different tables are allocated contiguously in memory.
- Parameters
inputs – List of files to read (in any order). Supported files are:
tped+tfamfilesrawfiles
- Returns
A Dataset object
Methods
-
inline GenotypeTable<T> &operator[](int i)
Access the bit table vector.
- Parameters
i – Index of the table in the vector
- Returns
A reference to the GenotypeTable object
Attributes
-
const size_t cases
Number of individuals in the case group of the data set
-
const size_t ctrls
Number of individuals in the control group of the data set
-
const size_t snps
Number of SNPs in the data set
-
GenotypeTable<T> *data
Vector holding all the GenotypeTable’s representing the individual SNP’s information
GenotypeTable class
-
template<class T>
class GenotypeTable Class representing the genotypes of M individuals using a binary encoding. The size of the table is determined by the number of SNPs S represented in combination: \( size = 3^S \).
The genotype information of the individuals is stored in two subtables, one for the individuals from the cases group, and another for the individuals from the controls group.
- tparam T
data type used to represent the individuals information in the table
Constructors
-
inline GenotypeTable(const short order, const size_t cases_words, const size_t ctrls_words)
Create a new uninitialized table, allocating an array with enough space to hold \( 3^{order} (cases\_words + ctrls\_words) \) values of type T.
- Parameters
order – Number of SNPs represented in combination inside the table
cases_words – Number of values of type T required to represent the genotypes for all individuals in the case group
ctrls_words – Number of values of type T required to represent the genotypes for all individuals in the control group
Methods
-
static void combine(const GenotypeTable<T> &t1, const GenotypeTable<T> &t2, GenotypeTable<T> &out) noexcept
Combine the SNPs represented in tables t1 and t2 into a single table out.
- Parameters
t1 – Reference to a GenotypeTable representing any number of SNPs in combination
t2 – Reference to a GenotypeTable representing a single SNP
out – Reference to a GenotypeTable where the combination of t1 and t2 will be stored
-
template<class U>
static void combine_and_popcnt(const GenotypeTable<T> &t1, const GenotypeTable<T> &t2, ContingencyTable<U> &out) noexcept Combine the SNPs represented in tables t1 and t2, and store the genotype frequencies in the ContingencyTable out.
- Parameters
t1 – Reference to a GenotypeTable representing any number of SNPs in combination
t2 – Reference to a GenotypeTable representing a single SNP
out – Reference to a ContingencyTable where the genotype frequencies of the combination of t1 and t2 will be stored
Attributes
-
const size_t order
Number of SNPs represented in the table
-
const size_t size
Number of rows contained in each subtable
-
const size_t cases_words
Number of values contained in each row of the cases subtable
-
const size_t ctrls_words
Number of values contained in each row of the controls subtable
Function GenotypeTable::combine_and_popcnt() has multiple
implementations:
File
src/avx512vpopcntdq/gt_popcnt.cpp:Functions
-
inline void avx512vpopcntdq_native_popcnt(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function relies on the
_mm512_popcnt_epi64AVX intrinsic introduced with theAVX512VPOPCNTDQextension to perform the popcount operation.
-
inline void avx512vpopcntdq_native_popcnt(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx512bw/gt_popcnt_avx512bw_hs.cpp:Functions
-
__m512i popcount(const __m512i v)
-
void CSA(__m512i &h, __m512i &l, __m512i a, __m512i b, __m512i c)
-
inline void avx512bw_popcnt_harley_seal(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX512_harley_sealfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
__m512i popcount(const __m512i v)
File
src/avx512bw/gt_popcnt_avx512bw_lu.cpp:Functions
-
inline void avx512bw_popcnt_lookup(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX512BW_lookup_originalfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx512bw_popcnt_lookup(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx512bw/gt_popcnt_avx2_cpu.cpp:Functions
-
inline void iter(const __m256i &val, uint32_t &result)
-
inline void avx512bw_popcnt_256_cpu(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_and_cpufrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void iter(const __m256i &val, uint32_t &result)
File
src/avx512bw/gt_popcnt_avx2_hs.cpp:Functions
-
__m256i popcount(const __m256i v)
-
void CSA(__m256i &h, __m256i &l, __m256i a, __m256i b, __m256i c)
-
inline void avx512bw_popcnt_256_harley_seal(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_harley_sealfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
__m256i popcount(const __m256i v)
File
src/avx512bw/gt_popcnt_avx2_lu.cpp:Functions
-
inline void iter(const __m256i val, const __m256i &lookup, const __m256i &low_mask, __m256i &local)
-
inline void avx512bw_popcnt_256_lookup(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_lookupfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void iter(const __m256i val, const __m256i &lookup, const __m256i &low_mask, __m256i &local)
File
src/avx512bw/gt_popcnt_avx2_lu_orig.cpp:Functions
-
inline void avx512bw_popcnt_256_lookup_original(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_lookup_originalfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx512bw_popcnt_256_lookup_original(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx512bw/gt_popcnt_native_movdq.cpp:Functions
-
inline void avx512bw_popcnt_64_native_movdq(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
builtin_popcnt_movdqfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx512bw_popcnt_64_native_movdq(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx512bw/gt_popcnt_native_unrolled_errata.cpp:Functions
-
inline void avx512bw_popcnt_64_native_unrolled_errata(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
builtin_popcnt_unrolled_erratafrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx512bw_popcnt_64_native_unrolled_errata(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx2/gt_popcnt_avx2_cpu.cpp:Functions
-
inline void iter(const uint64_t *ptr1, const uint64_t *ptr2, uint32_t &result)
-
inline void avx2_popcnt_cpu(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_and_cpufrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void iter(const uint64_t *ptr1, const uint64_t *ptr2, uint32_t &result)
File
src/avx2/gt_popcnt_avx2_hs.cpp:Functions
-
__m256i popcount(const __m256i v)
-
void CSA(__m256i &h, __m256i &l, __m256i a, __m256i b, __m256i c)
-
inline void avx2_popcnt_harley_seal(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_harley_sealfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
__m256i popcount(const __m256i v)
File
src/avx2/gt_popcnt_avx2_lu.cpp:Functions
-
inline void iter(const uint64_t *ptr1, const uint64_t *ptr2, const __m256i &lookup, const __m256i &low_mask, __m256i &local)
-
inline void avx2_popcnt_lookup(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_lookupfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void iter(const uint64_t *ptr1, const uint64_t *ptr2, const __m256i &lookup, const __m256i &low_mask, __m256i &local)
File
src/avx2/gt_popcnt_avx2_lu_orig.cpp:Functions
-
inline void avx2_popcnt_lookup_original(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
popcnt_AVX2_lookup_originalfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx2_popcnt_lookup_original(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx2/gt_popcnt_native_movdq.cpp:Functions
-
inline void avx2_popcnt_64_native_movdq(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
builtin_popcnt_movdqfrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx2_popcnt_64_native_movdq(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
File
src/avx2/gt_popcnt_native_unrolled_errata.cpp:Functions
-
inline void avx2_popcnt_64_native_unrolled_errata(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
Implements the combination of two genotype subtables and subsequent contingency table computation. The function is a fork of the function
builtin_popcnt_unrolled_erratafrom https://github.com/WojciechMula/sse-popcount/tree/6feb3dba32c526b17de01e931c116900e3a23104 modified to interleave the genotype table combination operations with thepopcountoperations.
-
inline void avx2_popcnt_64_native_unrolled_errata(const uint64_t *gt_tbl1, const size_t gt_size, const uint64_t *gt_tbl2, const size_t words, uint32_t *ct_tbl, const size_t ct_size)
ContingencyTable class
-
template<class T>
class ContingencyTable Class representing the genotypes frequencies of the combination of S SNPs for all individuals. The size of the table is: \( size = 3^S \).
The genotype frequencies are segregated by the class of the individuals (cases or controls) in two subtables.
- tparam T
data type used to represent the frequencies in the table
Constructors
-
inline ContingencyTable(const short order)
Create a new uninitialized table, allocating an array with enough space to hold \( 3^{order} \) values of type T.
- Parameters
order – Number of SNPs represented in combination inside the table
MutualInformation class
-
template<class T>
class MutualInformation : public Algorithm<T> Class implementing the Mutual Information (MI) computation. Taking the genotype and phenotype variability as two random variables \(X\) and \(Y\), MI can be obtained as:
\[\begin{equation} MI(X; Y) = H(X) + H(Y) - H(X,Y) \end{equation}\]Marginal entropies of a single and two variables are defined as:
\[\begin{equation} H(X) = - \sum_{x \in X} p(x)\log p(x) \end{equation}\]\[\begin{equation} H(X,Y) = - \sum_{x, y} p(x, y)\log p(x, y) \end{equation}\]These probabilities can be obtained directly from a ContingencyTable as the division between the number of occurrences and the number of total observations.
- tparam T
data type used to represent the MI values
Constructors
-
inline MutualInformation(unsigned int num_cases, unsigned int num_ctrls)
Create a MutualInformation instance that allows to compute the MI value of a particular ContingencyTable for a fixed number of cases and controls. The number of cases and controls is fixed so that multiple calculations of the MI for the same data set do not require to compute \(H(Y)\) repeatedly.
- Parameters
num_cases – Path to the tped data file
num_ctrls – Path to the tfam data file
Methods
-
template<class U>
T compute(const ContingencyTable<U> &table) const noexcept Compute the MI of a particular ContingencyTable.
- Parameters
table – ContingencyTable from which the MI value is to be calculated
- Template Parameters
U – Data type used in the input ContingencyTable’s to represent the count of individuals
- Returns
The MI value
Distribution class
-
template<typename T>
class Distribution Class that represents the distribution of the workload among all the compute units available. The distribution creates an assignment of all k-combinations without repetition from a set of n SNPs following a Round-robin distribution. To do this, combinations are enumerated using a particular step size and an initial offset.
- tparam T
Data type used to index the SNPs in the data set. Needs to be a signed type
Attributes
Constructors
-
inline Distribution(const T &n, const T &k, const T &step, const T &offset)
Create a new distribution.
- Parameters
n – Number of SNPs in the set
k – Size of the combinations to consider
step – Number of combinations to advance between iterations
offset – Number of combinations to advance before starting the iteration
Methods
-
inline Distribution<T> layer(const T step, const T offset) const
Create a new distribution from an existing one by layering a secondary step and offset. The new distribution will conserve the same number of n SNPs in the set and combination size k. The new step is calculated as \( step_{1} * step_{2} \), and the new offset is calculated as \( offset_{1} * step_{2} + offset_{2} \).
- Parameters
step – Secondary step to include in the new distribution
offset – Secondary offset to include in the new distribution
-
inline const_iterator begin() const
Returns an iterator to the first combination of the distribution. Combinations returned reuse the same std::vector object, succesively updating its content.
- Returns
Iterator to the first combination
-
inline const_iterator end() const
Returns an iterator to the combination following the last combination of the distribution.
- Returns
Iterator to the combination following the last combination.
-
class const_iterator
MPIEngine class
-
class MPIEngine
Epistasis search engine, implementing a distributed algorithm using MPI. Each node available in the MPI context is used to explore SNP combinations in parallel.
Constructors
Methods
-
template<typename T, typename ...Args>
inline std::vector<Result<int, float>> run(const std::vector<std::string> &inputs, const unsigned int order, const unsigned int outputs, Args&&... args) Run the epistasis search on the different MPI processes. Each process will, in turn, call Search::run to exploit the resources available to that process. The returned vector will only be available to process 0.
- Parameters
tped – Path to the tped data file
tfam – Path to the tfam data file
order – Order of the epistatic interactions to locate
outputs – Number of results to include in the output vector
args – Arguments to the Search class
- Template Parameters
- Returns
Vector of Result’s sorted in descending order by their MutualInformation value
-
template<typename T, typename ...Args>
Search class
-
class Search
Epistasis search class interface. Classes implementing the Search interface make use of the hardware resources available to a particular MPI process to examine the fraction of the SNP combinations assigned to that process.
Subclassed by ThreadedSearch
Methods
-
virtual std::vector<Result<int, float>> run(const Dataset<uint64_t> &dataset, const unsigned short order, const Distribution<int> &distribution, const unsigned int outputs) = 0
Run the epistasis search for this MPI process
- Parameters
dataset – Dataset from which the SNPs are read
order – Size of the combinations to explore
distribution – Distribution object for the particular combination distribution of this MPI process
outputs – Number of results to include in the output vector
- Returns
Vector of Result’s
-
virtual std::vector<Result<int, float>> run(const Dataset<uint64_t> &dataset, const unsigned short order, const Distribution<int> &distribution, const unsigned int outputs) = 0
ThreadedSearch class
-
class ThreadedSearch : public Search
Epistasis search class that uses CPU multi-threading to complete the search.
Constructors
-
inline ThreadedSearch(unsigned int threads)
Create a ThreadedSearch object.
- Parameters
threads – Number of threads to use during the search
Methods
-
inline virtual std::vector<Result<int, float>> run(const Dataset<uint64_t> &dataset, const unsigned short order, const Distribution<int> &distribution, const unsigned int outputs)
Run the epistasis search for this MPI process
- Parameters
dataset – Dataset from which the SNPs are read
order – Size of the combinations to explore
distribution – Distribution object for the particular combination distribution of this MPI process
outputs – Number of results to include in the output vector
- Returns
Vector of Result’s
-
inline ThreadedSearch(unsigned int threads)