Bayesian multiple logistic regression for case-control GWAS

Genetic variants in genome-wide association studies (GWAS) are tested for disease association mostly using simple regression, one variant at a time. Standard approaches to improve power in detecting disease-associated SNPs use multiple regression with Bayesian variable selection in which a sparsity-enforcing prior on effect sizes is used to avoid overtraining and all effect sizes are integrated out for posterior inference. For binary traits, the logistic model has not yielded clear improvements over the linear model. For multi-SNP analysis, the logistic model required costly and technically challenging MCMC sampling to perform the integration. Here, we introduce the quasi-Laplace approximation to solve the integral and avoid MCMC sampling. We expect the logistic model to perform much better than multiple linear regression except when predicted disease risks are spread closely around 0.5, because only close to its inflection point can the logistic function be well approximated by a linear function. Indeed, in extensive benchmarks with simulated phenotypes and real genotypes, our Bayesian multiple LOgistic REgression method (B-LORE) showed considerable improvements (1) when regressing on many variants in multiple loci at heritabilities ≥ 0.4 and (2) for unbalanced case-control ratios. B-LORE also enables meta-analysis by approximating the likelihood functions of individual studies by multivariate normal distributions, using their means and covariance matrices as summary statistics. Our work should make sparse multiple logistic regression attractive also for other applications with binary target variables. B-LORE is freely available from: https://github.com/soedinglab/b-lore.

[1]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[2]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[3]  Tanya M. Teslovich,et al.  Biological, Clinical, and Population Relevance of 95 Loci for Blood Lipids , 2010, Nature.

[4]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[5]  Gregory A. Poland,et al.  Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics , 2015, Genetics.

[6]  D. Altshuler,et al.  Informed Conditioning on Clinical Covariates Increases Power in Case-Control Association Studies , 2012, PLoS genetics.

[7]  M. Pirinen,et al.  Prospects of Fine-Mapping Trait-Associated Genomic Regions by Using Summary Statistics from Genome-wide Association Studies. , 2017, American journal of human genetics.

[8]  C. Gieger,et al.  Genome-wide association study identifies a new locus for coronary artery disease on chromosome 10 p 11 . 23 , 2010 .

[9]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[10]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[11]  Pim van der Harst,et al.  Identification of 64 Novel Genetic Loci Provides an Expanded View on the Genetic Architecture of Coronary Artery Disease , 2017, Circulation research.

[12]  E. Eskin,et al.  Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies , 2014, PLoS genetics.

[13]  P. Visscher,et al.  Estimating missing heritability for disease from genome-wide association studies. , 2011, American journal of human genetics.

[14]  Eugene Baulin,et al.  An updated version of NPIDB includes new classifications of DNA–protein complexes and their families , 2015, Nucleic Acids Res..

[15]  Karen L. Mohlke,et al.  Novel Loci for Adiponectin Levels and Their Influence on Type 2 Diabetes and Metabolic Traits: A Multi-Ethnic Meta-Analysis of 45,891 Individuals , 2012, PLoS genetics.

[16]  Xiang Zhu,et al.  Bayesian large-scale multiple regression with summary statistics from genome-wide association studies , 2016, bioRxiv.

[17]  Mary K. Wojczynski,et al.  Genome-Wide Association of Body Fat Distribution in African Ancestry Populations Suggests New Loci , 2013, PLoS genetics.

[18]  C. Morrison,et al.  Hormonal Contraception and the Risk of HIV Acquisition: An Individual Participant Data Meta-analysis , 2015, PLoS medicine.

[19]  Thomas W. Mühleisen,et al.  Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease , 2011, Nature Genetics.

[20]  Nicholette D. Palmer,et al.  Meta-Analysis of Genome-Wide Association Studies in African Americans Provides Insights into the Genetic Architecture of Type 2 Diabetes , 2014, PLoS genetics.

[21]  Daniel J Schaid,et al.  Incorporating Functional Annotations for Fine-Mapping Causal Variants in a Bayesian Framework Using Summary Statistics , 2016, Genetics.

[22]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[23]  Eun Yong Kang,et al.  Identifying Causal Variants at Loci with Multiple Signals of Association , 2014, Genetics.

[24]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[25]  Matti Pirinen,et al.  Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies , 2012, 1207.4886.

[26]  Matti Pirinen,et al.  FINEMAP: efficient variable selection using summary data from genome-wide association studies , 2015, bioRxiv.

[27]  Andrew D. Johnson,et al.  Fifteen new risk loci for coronary artery disease highlight arterial-wall-specific mechanisms , 2017, Nature Genetics.

[28]  Kristin G Ardlie,et al.  Genetic Analysis in UK Biobank Links Insulin Resistance and Transendothelial Migration Pathways to Coronary Artery Disease , 2017, Nature Genetics.

[29]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[30]  J. Danesh,et al.  Large-scale association analysis identifies new risk loci for coronary artery disease , 2013 .

[31]  Manolis Kellis,et al.  Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases , 2016, Nucleic acids research.

[32]  Mark C. Field,et al.  RAB-Like 2 Has an Essential Role in Male Fertility, Sperm Intra-Flagellar Transport, and Tail Assembly , 2012, PLoS genetics.

[33]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[34]  Claudio J. Verzilli,et al.  Multilocus Bayesian meta-analysis of gene-disease associations. , 2009, American journal of human genetics.

[35]  C. Gieger,et al.  Genome-wide association study identifies a new locus for coronary artery disease on chromosome 10p11.23. , 2011, European heart journal.

[36]  S. Yusuf,et al.  Global burden of cardiovascular diseases: Part II: variations in cardiovascular disease by specific ethnic groups and geographic regions and prevention strategies. , 2001, Circulation.

[37]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[38]  J. Danesh,et al.  A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2016 .

[39]  Sylvia Richardson,et al.  JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects , 2016, Genetic epidemiology.

[40]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[41]  Eleazar Eskin,et al.  Improved methods for multi-trait fine mapping of pleiotropic risk loci , 2016, bioRxiv.

[42]  S. Yusuf,et al.  Global burden of cardiovascular diseases: part I: general considerations, the epidemiologic transition, risk factors, and impact of urbanization. , 2001, Circulation.

[43]  D. Schaid,et al.  From genome-wide associations to candidate causal variants by statistical fine-mapping , 2018, Nature Reviews Genetics.

[44]  Benjamin J. Wright,et al.  New susceptibility locus for coronary artery disease on chromosome 3q22.3 , 2009, Nature Genetics.

[45]  N. Mehta Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. , 2011, Circulation. Cardiovascular genetics.

[46]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[47]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[48]  Andrew P Morris,et al.  Guidance for the utility of linear models in meta-analysis of genetic association studies of binary phenotypes , 2016, European Journal of Human Genetics.

[49]  Fabian J. Theis,et al.  DeepWAS : Directly integrating regulatory information into GWAS using 1 deep learning supports master regulator MEF 2 C as risk factor for major 2 depressive disorder 3 4 , 2016 .

[50]  Tamara S. Roman,et al.  New genetic loci link adipose and insulin biology to body fat distribution , 2014, Nature.

[51]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[52]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.