OCMA: Fast, Memory-Efficient Factorization of Prohibitively Large Relationship Matrices

Matrices representing genetic relatedness among individuals (i.e., Genomic Relationship Matrices, GRMs) play a central role in genetic analysis. The eigen-decomposition of GRMs (or its alternative that generates fewer top singular values using genotype matrices) is a necessary step for many analyses including estimation of SNP-heritability, Principal Component Analysis (PCA), and genomic prediction. However, the GRMs and genotype matrices provided by modern biobanks are too large to be stored in active memory. To accommodate the current and future “bigger-data”, we develop a disk-based tool, Out-of-Core Matrices Analyzer (OCMA), using state-of-the-art computational techniques that can nimbly perform eigen and Singular Value Decomposition (SVD) analyses. By integrating memory mapping (mmap) and the latest matrix factorization libraries, our tool is fast and memory-efficient. To demonstrate the impressive performance of OCMA, we test it on a personal computer. For full eigen-decomposition, it solves an ordinary GRM (N = 10,000) in 55 sec. For SVD, a commonly used faster alternative of full eigen-decomposition in genomic analyses, OCMA solves the top 200 singular values (SVs) in half an hour, top 2,000 SVs in 0.95 hr, and all 5,000 SVs in 1.77 hr based on a very large genotype matrix (N = 1,000,000, M = 5,000) on the same personal computer. OCMA also supports multi-threading when running in a desktop or HPC cluster. Our OCMA tool can thus alleviate the computing bottleneck of classical analyses on large genomic matrices, and make it possible to scale up current and emerging analytical methods to big genomics data using lightweight computing resources.

[1]  Andrés Tomás,et al.  Fast inexact mapping using advanced tree exploration on backward search methods , 2015, BMC Bioinformatics.

[2]  Qing Zhang,et al.  High-Performance Computing on the Intel® Xeon Phi™ , 2014, Springer International Publishing.

[3]  G. de los Campos,et al.  BGData - A Suite of R Packages for Genomic Analysis with Big Data , 2019, G3: Genes, Genomes, Genetics.

[4]  Bjarni J. Vilhjálmsson,et al.  JAWAMix5: an out-of-core HDF5-based java implementation of whole-genome association studies using mixed models , 2013, Bioinform..

[5]  Bin Li,et al.  HSPT: Practical Implementation and Efficient Management of Embedded Shadow Page Tables for Cross-ISA System Virtual Machines , 2015, VEE.

[6]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[7]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[8]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[9]  Osval A. Montesinos-López,et al.  A Genomic Bayesian Multi-trait and Multi-environment Model , 2016, G3: Genes, Genomes, Genetics.

[10]  G. de los Campos,et al.  Will Big Data Close the Missing Heritability Gap? , 2017, Genetics.

[11]  G. Su,et al.  Different methods to calculate genomic predictions--comparisons of BLUP at the single nucleotide polymorphism level (SNP-BLUP), BLUP at the individual level (G-BLUP), and the one-step approach (H-BLUP). , 2012, Journal of dairy science.

[12]  Markus Ringnér,et al.  What is principal component analysis? , 2008, Nature Biotechnology.

[13]  D. Gianola,et al.  Genomic Heritability: What Is It? , 2014, PLoS genetics.

[14]  A. Legarra,et al.  PREGSF90 – POSTGSF90: Computational tools for the implementation of single-step genomic selection and genome-wide association with ungenotyped individuals in BLUPF90 programs , 2014 .

[15]  Maya Gokhale,et al.  DI-MMAP—a scalable memory-map runtime for out-of-core data-intensive applications , 2015, Cluster Computing.

[16]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[17]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[18]  Minsuk Kahng,et al.  MMap: Fast billion-scale graph computation on a PC via memory mapping , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[19]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[20]  Tak Pui Lou,et al.  MMAPDNG: A new, fast code backed by a memory-mapped database for simulating delayed γ-ray emission with MCNPX package , 2015, Comput. Phys. Commun..

[21]  S. Clark,et al.  Genomic best linear unbiased prediction (gBLUP) for the estimation of genomic breeding values. , 2013, Methods in molecular biology.

[22]  David Heckerman,et al.  FaST-LMM-Select for addressing confounding from spatial structure and rare variants , 2013, Nature Genetics.

[23]  G. de los Campos,et al.  Genome-Wide Regression and Prediction with the BGLR Statistical Package , 2014, Genetics.

[24]  R. Collins What makes UK Biobank special? , 2012, The Lancet.

[25]  I Misztal,et al.  Multiple-trait genomic evaluation of linear type traits using genomic and phenotypic data in US Holsteins. , 2011, Journal of dairy science.

[26]  Christoph Lippert,et al.  Efficient set tests for the genetic analysis of correlated traits , 2015, Nature Methods.

[27]  I Misztal,et al.  Implementation of genomic recursions in single-step genomic best linear unbiased predictor for US Holsteins with a large number of genotyped animals. , 2016, Journal of dairy science.

[28]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[29]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[30]  M. Peplow The 100 000 Genomes Project , 2016, British Medical Journal.

[31]  Kunihiko Sadakane,et al.  Pair-end inexact mapping on hybrid GPU environments and out-of-core indexes , 2016 .

[32]  Yi Yang,et al.  BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.

[33]  M. Iqbal,et al.  Effect of Lr34/Yr18 on agronomic and quality traits in a spring wheat mapping population and implications for breeding , 2016, Molecular Breeding.

[34]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[35]  Yongqing Jiao,et al.  Genetic mapping of yield traits using RIL population derived from Fuchuan Dahuasheng and ICG6375 of peanut (Arachis hypogaea L.) , 2017, Molecular Breeding.

[36]  Inês Barroso,et al.  A linear mixed-model approach to study multivariate gene–environment interactions , 2018, Nature Genetics.

[37]  Bjarni J. Vilhjálmsson,et al.  A mixed-model approach for genome-wide association studies of correlated traits in structured populations , 2012, Nature Genetics.

[38]  R. Durbin,et al.  Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses , 2012, Nature Protocols.

[39]  Heon Young Yeom,et al.  Efficient Memory-Mapped I/O on Fast Storage Device , 2016, ACM Trans. Storage.

[40]  George Neville-Neil,et al.  The Design and Implementation of the FreeBSD Operating System , 2014 .

[41]  Gustavo de los Campos,et al.  A Suite of Packages for Analysis of Big Genomic Data [R package BGData version 2.2.0] , 2020 .

[42]  Eran Halperin,et al.  Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies , 2016, Nature Methods.

[43]  Jean-Luc Jannink,et al.  Multiple-Trait Genomic Selection Methods Increase Genetic Value Prediction Accuracy , 2012, Genetics.