Fast and flexible linear mixed models for genome-wide genetics

Linear mixed effect models are powerful tools used to account for population structure in genome-wide association studies (GWASs) and estimate the genetic architecture of complex traits. However, fully-specified models are computationally demanding and common simplifications often lead to reduced power or biased inference. We describe Grid-LMM (https://github.com/deruncie/GridLMM), an extendable algorithm for repeatedly fitting complex linear models that account for multiple sources of heterogeneity, such as additive and non-additive genetic variance, spatial heterogeneity, and genotype-environment interactions. Grid-LMM can compute approximate (yet highly accurate) frequentist test statistics or Bayesian posterior summaries at a genome-wide scale in a fraction of the time compared to existing general-purpose methods. We apply Grid-LMM to two types of quantitative genetic analyses. The first is focused on accounting for spatial variability and non-additive genetic variance while scanning for QTL; and the second aims to identify gene expression traits affected by non-additive genetic variation. In both cases, modeling multiple sources of heterogeneity leads to new discoveries. Author summary The goal of quantitative genetics is to characterize the relationship between genetic variation and variation in quantitative traits such as height, productivity, or disease susceptibility. A statistical method known as the linear mixed effect model has been critical to the development of quantitative genetics. First applied to animal breeding, this model now forms the basis of a wide-range of modern genomic analyses including genome-wide associations, polygenic modeling, and genomic prediction. The same model is also widely used in ecology, evolutionary genetics, social sciences, and many other fields. Mixed models are frequently multi-faceted, which is necessary for accurately modeling data that is generated from complex experimental designs. However, most genomic applications use only the simplest form of linear mixed methods because the computational demands for model fitting can be too great. We develop a flexible approach for fitting linear mixed models to genome scale data that greatly reduces their computational burden and provides flexibility for users to choose the best statistical paradigm for their data analysis. We demonstrate improved accuracy for genetic association tests, increased power to discover causal genetic variants, and the ability to provide accurate summaries of model uncertainty using both simulated and real data examples.

[1]  P. Mermelstein,et al.  Opposite Effects of mGluR1a and mGluR5 Activation on Nucleus Accumbens Medium Spiny Neuron Dendritic Spine Density , 2016, PloS one.

[2]  P. Bühlmann,et al.  Estimation for High‐Dimensional Linear Mixed‐Effects Models Using ℓ1‐Penalization , 2010, 1002.3784.

[3]  About the Multimodality of the Likelihood Function when Estimating the Variance Components in a One-Way Classification by Means of the Ml or REML Method , 1994 .

[4]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[5]  José Crossa,et al.  Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. , 2010, Genetics research.

[6]  Oliver Stegle,et al.  LiMMBo: a simple, scalable approach for linear mixed models in high-dimensional genetic association studies , 2018, bioRxiv.

[7]  Kateryna Mishchenko,et al.  New Algorithms for Evaluating the Log-Likelihood Function Derivatives in the AI-REML Method , 2009, Commun. Stat. Simul. Comput..

[8]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[9]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[10]  Erik Postma,et al.  An ecologist's guide to the animal model. , 2010, The Journal of animal ecology.

[11]  Sayan Mukherjee,et al.  Adaptive Randomized Dimension Reduction on Massive Data , 2015, J. Mach. Learn. Res..

[12]  Rachel E. Kerwin,et al.  Epistasis × environment interactions among Arabidopsis thaliana glucosinolate genes impact complex traits and fitness in the field. , 2017, The New phytologist.

[13]  Luis Varona,et al.  On the Additive and Dominant Variance and Covariance of Individuals Within the Genomic Selection Scope , 2013, Genetics.

[14]  Xi Chen,et al.  An efficient hierarchical generalized linear mixed model for pathway analysis of genome-wide association studies , 2011, Bioinform..

[15]  M. Bonder,et al.  A linear mixed-model approach to study multivariate gene–environment interactions , 2018, Nature Genetics.

[16]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[17]  M. McPeek,et al.  Two-way mixed-effects methods for joint association analysis using both host and pathogen genomes , 2018, Proceedings of the National Academy of Sciences.

[18]  Martin S. Taylor,et al.  Genome-wide genetic association of complex traits in heterogeneous stock mice , 2006, Nature Genetics.

[19]  Bjarni J. Vilhjálmsson,et al.  The nature of confounding in genome-wide association studies , 2012, Nature Reviews Genetics.

[20]  Sayan Mukherjee,et al.  Dissecting High-Dimensional Phenotypes with Bayesian Sparse Factor Analysis of Genetic Covariance Matrices , 2012, Genetics.

[21]  D. Absher,et al.  A Flexible, Efficient Binomial Mixed Model for Identifying Differential DNA Methylation in Bisulfite Sequencing Data , 2015, bioRxiv.

[22]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[23]  David Heckerman,et al.  A powerful and efficient set test for genetic markers that handles confounders , 2012, Bioinform..

[24]  A. R. Gilmour,et al.  Mixed model regression mapping for QTL detection in experimental crosses , 2007, Comput. Stat. Data Anal..

[25]  Xiaoping Zhou A Unified Framework for Variance Component Estimation with Summary Statistics in Genome-wide Association Studies , 2016, bioRxiv.

[26]  David Heckerman,et al.  Ludicrous Speed Linear Mixed Models for Genome-Wide Association Studies , 2017, bioRxiv.

[27]  Sayan Mukherjee,et al.  Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits , 2016, bioRxiv.

[28]  Jarrod Had MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package , 2010 .

[29]  P. Phillips Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems , 2008, Nature Reviews Genetics.

[30]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[31]  G. Covarrubias-Pazaran Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer , 2016, PloS one.

[32]  G. Coop,et al.  Reduced signal for polygenic adaptation of height in UK Biobank , 2018, bioRxiv.

[33]  W. Ewens Genetics and analysis of quantitative traits , 1999 .

[34]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[35]  Jon Wakefield,et al.  Bayes factors for genome‐wide association studies: comparison with P‐values , 2009, Genetic epidemiology.

[36]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[37]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[38]  T. Mackay Epistasis and quantitative traits: using model organisms to study gene–gene interactions , 2013, Nature Reviews Genetics.

[39]  Taylor J. Maxwell,et al.  Replication of long-bone length QTL in the F9-F10 LG,SM advanced intercross , 2009, Mammalian Genome.

[40]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[41]  Sayan Mukherjee,et al.  Scalable Algorithms for Learning High-Dimensional Linear Mixed Models , 2018, UAI.

[42]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[43]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[44]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.

[45]  Inês Barroso,et al.  A linear mixed-model approach to study multivariate gene–environment interactions , 2018, Nature Genetics.

[46]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[47]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[48]  A. Carriquiry,et al.  Parametric and Nonparametric Statistical Methods for Genomic Selection of Traits with Additive and Epistatic Genetic Architectures , 2014, G3: Genes, Genomes, Genetics.

[49]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[50]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[51]  J. Cheverud Genetics and analysis of quantitative traits , 1999 .

[52]  M. Lynch METHODS FOR THE ANALYSIS OF COMPARATIVE DATA IN EVOLUTIONARY BIOLOGY , 1991, Evolution; international journal of organic evolution.

[53]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[54]  Hua Zhou,et al.  Fast Genome‐Wide QTL Association Mapping on Pedigree and Population Data , 2014, Genetic epidemiology.

[55]  Stefan R. Henz,et al.  Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions , 2016, Cell.

[56]  P. Gustafson,et al.  Conservative prior distributions for variance parameters in hierarchical models , 2006 .

[57]  D. Heckerman,et al.  Linear mixed model for heritability estimation that explicitly addresses environmental variation , 2016, Proceedings of the National Academy of Sciences.

[58]  Bing Zhang,et al.  An Integrated Approach for the Analysis of Biological Pathways using Mixed Models , 2008, PLoS genetics.

[59]  M. Boehnke,et al.  Multi-SKAT: General framework to test multiple phenotype associations of rare variants , 2017, bioRxiv.

[60]  Bjarni J. Vilhjálmsson,et al.  A mixed-model approach for genome-wide association studies of correlated traits in structured populations , 2012, Nature Genetics.

[61]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[62]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[63]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[64]  J. Vanhatalo,et al.  Approximate inference for disease mapping with sparse Gaussian processes , 2010, Statistics in medicine.

[65]  Xinyan Zhang,et al.  The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection , 2016, Genetics.

[66]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[67]  Diptavo Dutta,et al.  Multi-SKAT: General framework to test for rare variant association with multiple phenotypes , 2018 .

[68]  Min A. Jhun,et al.  SNP Set Association Analysis for Familial Data , 2012, Genetic epidemiology.

[69]  Martin S. Taylor,et al.  A High-Resolution Single Nucleotide Polymorphism Genetic Map of the Mouse Genome , 2006, PLoS biology.

[70]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[71]  Emrah Kostem,et al.  Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models , 2016, PLoS genetics.

[72]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[73]  D. Absher,et al.  A Flexible, Efficient Binomial Mixed Model for Identifying Differential DNA Methylation in Bisulfite Sequencing Data , 2015, bioRxiv.

[74]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[75]  David Heckerman,et al.  Accurate liability estimation improves power in ascertained case-control studies , 2014, Nature Methods.

[76]  Kai Wang,et al.  Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method. , 2013, Statistics and its interface.

[77]  P. Visscher,et al.  Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model , 2015, PLoS genetics.

[78]  William J. Astle,et al.  Population Structure and Cryptic Relatedness in Genetic Association Studies , 2009, 1010.4681.