Sci-LMM is an efficient strategy for inferring genetic variance components using population scale family trees

The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals and trillions of pairs of relatives. Such pedigrees provide the opportunity to answer genetic and epidemiological questions in scales much larger than previously possible. Linear mixed models (LMMs) are often used for analysis of pedigree data. However, traditional LMMs do not scale well due to steep computational and storage requirements. Here, we propose a novel modeling framework called Sparse Cholesky factorIzation LMM (Sci-LMM), that alleviates these difficulties by exploiting the sparsity patterns found in population-scale family-trees. The proposed framework constructs a matrix of genetic relationships between trillions of pairs of individuals in several hours, and can fit the corresponding LMM in several days. We demonstrate the capabilities of Sci-LMM via simulation studies and by estimating the heritability of longevity in a very large pedigree spanning millions of individuals and over five centuries of human history. The Sci-LMM framework enables the analysis of extremely large pedigrees that was not previously possible. Sci-LMM is available at https://github.com/TalShor/SciLMM.

[1]  S. Cohn,et al.  Ooce Note Series on Global Modeling and Data Assimilation Construction of Correlation Functions in Two and Three Dimensions and Convolution Covariance Functions , 2022 .

[2]  D. Sorensen IMPLICITLY RESTARTED ARNOLDI/LANCZOS METHODS FOR LARGE SCALE EIGENVALUE CALCULATIONS , 1996 .

[3]  Robin Thompson,et al.  Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models , 1995 .

[4]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[5]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[6]  YANQING CHEN,et al.  Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[7]  R. Elston,et al.  The investigation of linkage between a quantitative trait and a marker locus , 1972, Behavior genetics.

[8]  Jordan W. Smoller,et al.  The use of electronic health records for psychiatric phenotyping and genomics , 2018, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[9]  Xiayuan Huang,et al.  Applying family analyses to electronic health records to facilitate genetic research , 2018, Bioinform..

[10]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[11]  Fernando Sansò,et al.  Finite covariance functions , 1987 .

[12]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[13]  T. Meuwissen,et al.  Computing inbreeding coefficients in large populations , 1992, Genetics Selection Evolution.

[14]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[15]  Dan Geiger,et al.  Quantitative analysis of population-scale family trees with millions of relatives , 2017, Science.

[16]  O. Kempthorne,et al.  The correlation between relatives in a random mating population , 1954, Proceedings of the Royal Society of London. Series B - Biological Sciences.

[17]  Holger Wendland,et al.  Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree , 1995, Adv. Comput. Math..

[18]  Sewall Wright,et al.  Coefficients of Inbreeding and Relationship , 1922, The American Naturalist.

[19]  S. Bakken,et al.  Estimate of disease heritability using 7.4 million familial relationships inferred from electronic health records , 2016, bioRxiv.

[20]  C. R. Henderson Best Linear Unbiased Prediction of Nonadditive Genetic Merits in Noninbred Populations , 1985 .

[21]  T. Gneiting Correlation functions for atmospheric data analysis , 1999 .

[22]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[23]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[24]  Yaniv Erlich,et al.  Quantitative analysis of population-scale family trees using millions of relatives , 2017, bioRxiv.

[25]  N. Wray,et al.  Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis , 2015, Nature Genetics.

[26]  Saharon Rosset,et al.  Effective genetic-risk prediction using mixed models. , 2014, American journal of human genetics.

[27]  L. Kruuk,et al.  How to separate genetic and environmental causes of similarity between relatives , 2007, Journal of evolutionary biology.

[28]  David Heckerman,et al.  Greater power and computational efficiency for kernel-based association testing of sets of genetic variants , 2014, Bioinform..

[29]  T. Gneiting Compactly Supported Correlation Functions , 2002 .

[30]  Martin D. Buhmann,et al.  A new class of radial basis functions with compact support , 2001, Math. Comput..

[31]  C. R. Henderson A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values , 1976 .