Estimating variance components in population scale family trees

The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals and trillions of pairs of relatives. Such pedigrees provide the opportunity to investigate the sociological and epidemiological history of human populations in scales much larger than previously possible. Linear mixed models (LMMs) are routinely used to analyze extremely large animal and plant pedigrees for the purposes of selective breeding. However, LMMs have not been previously applied to analyze population-scale human family trees. Here, we present Sparse Cholesky factorIzation LMM (Sci-LMM), a modeling framework for studying population-scale family trees that combines techniques from the animal and plant breeding literature and from human genetics literature. The proposed framework can construct a matrix of relationships between trillions of pairs of individuals and fit the corresponding LMM in several hours. We demonstrate the capabilities of Sci-LMM via simulation studies and by estimating the heritability of longevity and of reproductive fitness (quantified via number of children) in a large pedigree spanning millions of individuals and over five centuries of human history. Sci-LMM provides a unified framework for investigating the epidemiological history of human populations via genealogical records.

[1]  Mohammadreza Hajy Heydary,et al.  Fast estimation of genetic correlation for biobank-scale data , 2019, bioRxiv.

[2]  Kathryn S. Burch,et al.  Efficient variance components analysis across millions of genomes , 2019, Nature Communications.

[3]  Michel Georges,et al.  Harnessing genomic information for livestock improvement , 2018, Nature Reviews Genetics.

[4]  S. Gravel,et al.  Inferring Transmission Histories of Rare Alleles in Population-Scale Genealogies. , 2018, American journal of human genetics.

[5]  Jake K. Byrnes,et al.  Estimates of the Heritability of Human Longevity Are Substantially Inflated due to Assortative Mating , 2018, Genetics.

[6]  Jordan W. Smoller,et al.  The use of electronic health records for psychiatric phenotyping and genomics , 2018, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[7]  D. Gudbjartsson,et al.  Relatedness disequilibrium regression estimates heritability without environmental bias , 2018, Nature Genetics.

[8]  S. Rosset,et al.  Estimating SNP-Based Heritability and Genetic Correlation in Case-Control Studies Directly and with Summary Statistics. , 2018, American journal of human genetics.

[9]  Guo-Bo Chen,et al.  A new genomic prediction method with additive-dominance effects in the least-squares framework , 2018, Heredity.

[10]  Po-Ru Loh,et al.  Mixed-model association for biobank-scale datasets , 2018, Nature Genetics.

[11]  S. Bakken,et al.  Disease Heritability Inferred from Familial Relationships Reported in Medical Records , 2018, Cell.

[12]  Dan Geiger,et al.  Quantitative analysis of population-scale family trees with millions of relatives , 2017, Science.

[13]  Yue Wu,et al.  A scalable estimator of SNP heritability for biobank-scale data , 2018, bioRxiv.

[14]  Xiayuan Huang,et al.  Applying family analyses to electronic health records to facilitate genetic research , 2018, Bioinform..

[15]  M. Feldman,et al.  Missing compared to what? Revisiting heritability, genes and culture , 2018, Philosophical Transactions of the Royal Society B: Biological Sciences.

[16]  Dan Geiger,et al.  Sci-LMM is an efficient strategy for inferring genetic variance components using population scale family trees , 2018, bioRxiv.

[17]  Bjarni V. Halldórsson,et al.  The nature of nurture: Effects of parental genotypes , 2017, Science.

[18]  Anna Bonnet Heritability estimation in case-control studies , 2018 .

[19]  Hongyu Zhao,et al.  A powerful approach to estimating annotation-stratified genetic covariance using GWAS summary statistics , 2017, bioRxiv.

[20]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.

[21]  I. Strandén,et al.  Efficient single-step genomic evaluation for a multibreed beef cattle population having many genotyped animals. , 2017, Journal of animal science.

[22]  K. Rawlik,et al.  An atlas of genetic associations in UK Biobank , 2017, Nature Genetics.

[23]  Z. Vitezica,et al.  Prediction of complex traits: Conciliating genetics and statistics. , 2017, Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie.

[24]  I Misztal,et al.  Invited review: efficient computation strategies in genomic selection. , 2017, Animal : an international journal of animal bioscience.

[25]  Guo-Bo Chen,et al.  A fast genomic selection approach for large genomic data , 2017, Theoretical and Applied Genetics.

[26]  J. Reid,et al.  Accounting for genetic differences among unknown parents in microevolutionary studies: how to include genetic groups in quantitative genetic animal models , 2016, The Journal of animal ecology.

[27]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[28]  R. Fernando,et al.  Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals , 2016, Genetics Selection Evolution.

[29]  Tian Ge,et al.  Phenome-wide heritability analysis of the UK Biobank , 2016, bioRxiv.

[30]  B. Craig,et al.  Walking through the statistical black boxes of plant breeding , 2016, Theoretical and Applied Genetics.

[31]  Per Madsen,et al.  Sparse single-step method for genomic evaluation in pigs , 2016, Genetics Selection Evolution.

[32]  Xiaoping Zhou A Unified Framework for Variance Component Estimation with Summary Statistics in Genome-wide Association Studies , 2016, bioRxiv.

[33]  Michael E. Goddard,et al.  Genomic selection: A paradigm shift in animal breeding , 2016 .

[34]  I Misztal,et al.  Technical note: Acceleration of sparse operations for average-information REML analyses with supernodal methods and sparse-storage refinements. , 2015, Journal of animal science.

[35]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[36]  Seung Hwan Lee,et al.  MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information , 2015, bioRxiv.

[37]  Yakir A Reshef,et al.  Partitioning heritability by functional annotation using genome-wide association summary statistics , 2015, Nature Genetics.

[38]  Brendan Bulik-Sullivan,et al.  Relationship between LD Score and Haseman-Elston Regression , 2015, bioRxiv.

[39]  M. Daly,et al.  An Atlas of Genetic Correlations across Human Diseases and Traits , 2015, Nature Genetics.

[40]  Kari Stefansson,et al.  Sequence variants from whole genome sequencing a large group of Icelanders , 2015, Scientific Data.

[41]  N. Wray,et al.  Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis , 2015, Nature Genetics.

[42]  D. Gianola,et al.  One hundred years of statistical developments in animal breeding. , 2015, Annual review of animal biosciences.

[43]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[44]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[45]  Ismo Strandén,et al.  MiX99 : Technical reference guide for MiX99 solver , 2015 .

[46]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[47]  S. Rosset,et al.  Measuring missing heritability: Inferring the contribution of common variants , 2014, Proceedings of the National Academy of Sciences.

[48]  Ignacy Misztal,et al.  Single Step, a general approach for genomic selection , 2014 .

[49]  T. Sonstegard,et al.  The development of genomics applied to dairy breeding , 2014 .

[50]  David Heckerman,et al.  Greater power and computational efficiency for kernel-based association testing of sets of genetic variants , 2014, Bioinform..

[51]  Zhiqiu Hu,et al.  Marker-Based Estimation of Genetic Parameters in Genomics , 2014, PloS one.

[52]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[53]  Saharon Rosset,et al.  Effective genetic-risk prediction using mixed models. , 2014, American journal of human genetics.

[54]  Guo-Bo Chen,et al.  Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman–Elston regression , 2014, Front. Genet..

[55]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[56]  Ismo Strandén,et al.  Employing a Monte Carlo Algorithm in Newton-Type Methods for Restricted Maximum Likelihood Estimation of Genetic Parameters , 2013, PloS one.

[57]  Jianxin Shi,et al.  Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs , 2013, Nature Genetics.

[58]  R. Fernando,et al.  Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor , 2013, PLoS genetics.

[59]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[60]  J. Vespa,et al.  America ’ s Families and Living Arrangements : 2007 , 2013 .

[61]  M. Wolak nadiv : an R package to create relatedness matrices for estimating non‐additive genetic variances in animal models , 2012 .

[62]  Sang Hong Lee,et al.  Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood , 2012, Bioinform..

[63]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[64]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[65]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[66]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[67]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[68]  Jarrod D. Hadfield,et al.  MCMC methods for multi-response generalized linear mixed models , 2010 .

[69]  I Misztal,et al.  Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. , 2010, Journal of dairy science.

[70]  M. Abney,et al.  Heritability of reproductive fitness traits in a human population , 2010, Proceedings of the National Academy of Sciences.

[71]  W. G. Hill,et al.  Understanding and using quantitative genetic variation , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[72]  M. Lund,et al.  Genomic prediction when some animals are not genotyped , 2010, Genetics Selection Evolution.

[73]  I Misztal,et al.  A relationship matrix including full pedigree and genomic information. , 2009, Journal of dairy science.

[74]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[75]  YANQING CHEN,et al.  Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[76]  Robin Thompson,et al.  Estimation of quantitative genetic parameters , 2008, Proceedings of the Royal Society B: Biological Sciences.

[77]  Xiaofeng Zhu,et al.  A unified association analysis approach for family and unrelated samples correcting for stratification. , 2008, American journal of human genetics.

[78]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[79]  Karin Meyer,et al.  WOMBAT—A tool for mixed model analyses in quantitative genetics by restricted maximum likelihood (REML) , 2007, Journal of Zhejiang University SCIENCE B.

[80]  L. Kruuk,et al.  How to separate genetic and environmental causes of similarity between relatives , 2007, Journal of evolutionary biology.

[81]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[82]  P. Bijma Estimating maternal genetic effects in livestock. , 2006, Journal of animal science.

[83]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[84]  Sang Hong Lee,et al.  An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree , 2005, Genetics Selection Evolution.

[85]  S. Brotherstone,et al.  Estimation of quantitative genetic parameters , 2008 .

[86]  L. Kruuk Estimating genetic parameters in natural populations using the "animal model". , 2004, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[87]  Robin Thompson,et al.  Prospects for statistical methods in animal breeding , 2004 .

[88]  Mark Von Tress,et al.  Generalized, Linear, and Mixed Models , 2003, Technometrics.

[89]  T. Gneiting Compactly Supported Correlation Functions , 2002 .

[90]  Ignacy Misztal,et al.  BLUPF90 and related programs (BGF90) , 2002 .

[91]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[92]  Martin D. Buhmann,et al.  A new class of radial basis functions with compact support , 2001, Math. Comput..

[93]  D. Gianola Statistics in Animal Breeding , 2000 .

[94]  W. Ewens Genetics and analysis of quantitative traits , 1999 .

[95]  T. Gneiting Correlation functions for atmospheric data analysis , 1999 .

[96]  S. Cohn,et al.  Ooce Note Series on Global Modeling and Data Assimilation Construction of Correlation Functions in Two and Three Dimensions and Convolution Covariance Functions , 2022 .

[97]  A. Hofer,et al.  Variance component estimation in animal breeding: a review† , 1998 .

[98]  D. Sorensen IMPLICITLY RESTARTED ARNOLDI/LANCZOS METHODS FOR LARGE SCALE EIGENVALUE CALCULATIONS , 1996 .

[99]  Holger Wendland,et al.  Piecewise polynomial, positive definite and compactly supported radial functions of minimal degree , 1995, Adv. Comput. Math..

[100]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[101]  Robin Thompson,et al.  Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models , 1995 .

[102]  T. Meuwissen,et al.  Computing inbreeding coefficients in large populations , 1992, Genetics Selection Evolution.

[103]  P. VanRaden,et al.  Rapid inversion of additive by additive relationship matrices by including sire-dam combination effects. , 1991, Journal of dairy science.

[104]  Fernando Sansò,et al.  Finite covariance functions , 1987 .

[105]  C. R. Henderson Best Linear Unbiased Prediction of Nonadditive Genetic Merits in Noninbred Populations , 1985 .

[106]  American families and living arrangements. , 1980, Current population reports. Series P-20, Population characteristics.

[107]  R. L. Quaas,et al.  Computing the Diagonal Elements and Inverse of a Large Numerator Relationship Matrix , 1976 .

[108]  C. R. Henderson A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values , 1976 .

[109]  Lynn Roy LaMotte,et al.  Quadratic Estimation of Variance Components , 1973 .

[110]  C. R. Rao,et al.  Estimation of Variance and Covariance Components in Linear Models , 1972 .

[111]  R. Elston,et al.  The investigation of linkage between a quantitative trait and a marker locus , 1972, Behavior genetics.

[112]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .

[113]  C. Radhakrishna Rao,et al.  Minimum variance quadratic unbiased estimation of variance components , 1971 .

[114]  C. R. Rao,et al.  Estimation of variance and covariance components--MINQUE theory , 1971 .

[115]  Calyampudi R. Rao Estimation of Heteroscedastic Variances in Linear Models , 1970 .

[116]  Truman Botts,et al.  Conference Board of the Mathematical Sciences , 1978, CACM.

[117]  S. R. Searle,et al.  The estimation of environmental and genetic trends from records subject to culling. , 1959 .

[118]  O. Kempthorne,et al.  The correlation between relatives in a random mating population , 1954, Proceedings of the Royal Society of London. Series B - Biological Sciences.

[119]  C. Cockerham,et al.  An Extension of the Concept of Partitioning Hereditary Variance for Analysis of Covariances among Relatives When Epistasis Is Present. , 1954, Genetics.

[120]  Sewall Wright,et al.  Coefficients of Inbreeding and Relationship , 1922, The American Naturalist.

[121]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .