Computationally efficient whole-genome regression for quantitative and binary traits

Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine learning method called REGENIE for fitting a whole genome regression model that is orders of magnitude faster than alternatives, while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes, and only requires local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives which must load genomewide matrices into memory. This results in substantial savings in compute time and memory usage. The method is applicable to both quantitative and binary phenotypes, including rare variant analysis of binary traits with unbalanced case-control ratios where we introduce a fast, approximate Firth logistic regression test. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach compared to several existing methods using quantitative and binary traits from the UK Biobank dataset with up to 407,746 individuals.

[1]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[2]  Lars G Fritsche,et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies , 2017, Nature Genetics.

[3]  M. McCarthy,et al.  A Powerful Approach to Sub-Phenotype Analysis in Population-Based Genetic Association Studies , 2009, Genetic epidemiology.

[4]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[5]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[6]  Wei Zhou,et al.  Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts , 2019, Nature Genetics.

[7]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[8]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[9]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[10]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[11]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[12]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[13]  J. Stoyanov Saddlepoint Approximations with Applications , 2008 .

[14]  M. Pirinen,et al.  Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. , 2017, American journal of human genetics.

[15]  Gilean McVean,et al.  Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes , 2016, Bioinform..

[16]  Xihong Lin,et al.  Optimal tests for rare variant effects in sequencing association studies. , 2012, Biostatistics.

[17]  P. Visscher,et al.  A resource-efficient tool for mixed model association analysis of large-scale data , 2019, Nature Genetics.

[18]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[19]  Bjarni J. Vilhjálmsson,et al.  A mixed-model approach for genome-wide association studies of correlated traits in structured populations , 2012, Nature Genetics.

[20]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[21]  J. Marchini,et al.  Gene-environment interactions using a Bayesian whole genome regression model , 2019, bioRxiv.

[22]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[23]  Seunggeun Lee,et al.  A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS , 2017, bioRxiv.

[24]  Diptavo Dutta,et al.  Multi‐SKAT: General framework to test for rare‐variant association with multiple phenotypes , 2018, Genetic epidemiology.

[25]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[26]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[27]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[28]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[29]  Tatiana I Axenovich,et al.  Rapid variance components–based method for whole-genome association analysis , 2012, Nature Genetics.

[30]  M. Calus,et al.  Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding , 2013, Genetics.

[31]  J. Marchini,et al.  A multiple phenotype imputation method for genetic studies , 2016, Nature Genetics.

[32]  Martin Morgan,et al.  gwasurvivr: an R package for genome-wide survival analysis , 2019, Bioinform..

[33]  S. Chib,et al.  Analysis of multivariate probit models , 1998 .

[34]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[35]  Po-Ru Loh,et al.  Mixed-model association for biobank-scale datasets , 2018, Nature Genetics.

[36]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[37]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[38]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[39]  Fabian L. Wauthier,et al.  Identifying loci affecting trait variability and detecting interactions in genome-wide association studies , 2018, Nature Genetics.

[40]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[41]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.