Sci-LMM is an efficient strategy for inferring genetic variance components using population scale family trees

The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals. Such pedigrees provide the opportunity to answer genetic and epidemiological questions in scales much larger than previously possible. Linear mixed models (LMMs) are often used for analysis of pedigree data. However, LMMs cannot naturally scale to large pedigrees spanning millions of individuals, owing to their steep computational and storage requirements. Here we propose a novel modeling framework called Sparse Cholesky factorIzation LMM (SciLMM), that alleviates these difficulties by exploiting the sparsity patterns found in large pedigree data. The proposed framework can construct a matrix of genetic relationships between trillions of pairs of individuals in several hours, and can fit the corresponding LMM in several days. We demonstrate the capabilities of SciLMM via simulation studies and by estimating the heritability of longevity in a very large pedigree spanning millions of individuals and over five centuries of human history. The SciLMM framework enables the analysis of extremely large pedigrees that was not previously possible. SciLMM is available at https://github.com/TalShor/SciLMM.

[1]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[2]  Yaniv Erlich,et al.  Quantitative analysis of population-scale family trees using millions of relatives , 2017, bioRxiv.

[3]  T. Gneiting Compactly Supported Correlation Functions , 2002 .

[4]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[5]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[6]  R. Elston,et al.  The investigation of linkage between a quantitative trait and a marker locus , 1972, Behavior genetics.

[7]  Fernando Sansò,et al.  Finite covariance functions , 1987 .

[8]  N. Wray,et al.  Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis , 2015, Nature Genetics.

[9]  Saharon Rosset,et al.  Effective genetic-risk prediction using mixed models. , 2014, American journal of human genetics.

[10]  L. Kruuk,et al.  How to separate genetic and environmental causes of similarity between relatives , 2007, Journal of evolutionary biology.

[11]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[12]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[13]  T. Meuwissen,et al.  Computing inbreeding coefficients in large populations , 1992, Genetics Selection Evolution.

[14]  S. B. Cáceres Electronic health records: beyond the digitization of medical files , 2013, Clinics.

[15]  Martin D. Buhmann,et al.  A new class of radial basis functions with compact support , 2001, Math. Comput..

[16]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[17]  YANQING CHEN,et al.  Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[18]  O. Kempthorne,et al.  The correlation between relatives in a random mating population , 1954, Proceedings of the Royal Society of London. Series B - Biological Sciences.

[19]  David Heckerman,et al.  Greater power and computational efficiency for kernel-based association testing of sets of genetic variants , 2014, Bioinform..

[20]  T. Gneiting Correlation functions for atmospheric data analysis , 1999 .

[21]  Holger Wendland,et al.  Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree , 1995, Adv. Comput. Math..

[22]  Sewall Wright,et al.  Coefficients of Inbreeding and Relationship , 1922, The American Naturalist.

[23]  S. Cohn,et al.  Ooce Note Series on Global Modeling and Data Assimilation Construction of Correlation Functions in Two and Three Dimensions and Convolution Covariance Functions , 2022 .

[24]  D. Sorensen IMPLICITLY RESTARTED ARNOLDI/LANCZOS METHODS FOR LARGE SCALE EIGENVALUE CALCULATIONS , 1996 .

[25]  Robin Thompson,et al.  Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models , 1995 .

[26]  C. R. Henderson Best Linear Unbiased Prediction of Nonadditive Genetic Merits in Noninbred Populations , 1985 .

[27]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[28]  C. R. Henderson A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values , 1976 .