Bayesian variable selection regression for genome-wide association studies and other large-scale problems

We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete "correct" subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. Here we focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of "missing heritability" in genome-wide association studies.

[1]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[2]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[3]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[4]  C. Calvi Parisetti,et al.  A‐G Reference Informative Prior: A Note on Zellner's G‐Prior , 1988 .

[5]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[6]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[7]  R. Kohn,et al.  Nonparametric regression using Bayesian variable selection , 1996 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[10]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[11]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[12]  J. Pritchard Are rare variants responsible for susceptibility to complex diseases? , 2001, American journal of human genetics.

[13]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[14]  J. Pankow,et al.  Familial and genetic determinants of systemic markers of inflammation: the NHLBI family heart study. , 2001, Atherosclerosis.

[15]  T. Fearn,et al.  Bayes model averaging with selection of regressors , 2002 .

[16]  Ose,et al.  Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events* , 2002 .

[17]  S. Ebrahim,et al.  'Mendelian randomization': can genetic epidemiology contribute to understanding environmental determinants of disease? , 2003, International journal of epidemiology.

[18]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[19]  D. Madigan,et al.  [Least Angle Regression]: Discussion , 2004 .

[20]  J. Berger,et al.  Optimal predictive model selection , 2004, math/0406464.

[21]  D. Clayton,et al.  Population structure, differential bias and genomic control in a large-scale, case-control association study , 2005, Nature Genetics.

[22]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[23]  Yongmei Liu,et al.  Heritability and Expression of C‐Reactive Protein in Type 2 Diabetes in the Diabetes Heart Study , 2006, Annals of human genetics.

[24]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[25]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[26]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[27]  Stephen M. Krone,et al.  Small-world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing , 2007 .

[28]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[29]  M. Stephens,et al.  High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation , 2008, PLoS genetics.

[30]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[31]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[32]  N. Cook,et al.  Loci related to metabolic-syndrome pathways including LEPR,HNF1A, IL6R, and GCKR associate with plasma C-reactive protein: the Women's Genome Health Study. , 2008, American journal of human genetics.

[33]  Yongtao Guan,et al.  Practical Issues in Imputation-Based Association Mapping , 2008, PLoS genetics.

[34]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[35]  M. Clyde,et al.  Mixtures of g Priors for Bayesian Variable Selection , 2008 .

[36]  Claudio J. Verzilli,et al.  Bayesian meta-analysis of genetic association studies with different sets of markers. , 2007, American journal of human genetics.

[37]  M. Rieder,et al.  Polymorphisms of the HNF1A gene encoding hepatocyte nuclear factor-1 alpha are associated with C-reactive protein. , 2008, American journal of human genetics.

[38]  Jon Wakefield,et al.  Bayes factors for genome‐wide association studies: comparison with P‐values , 2009, Genetic epidemiology.

[39]  M. Stephens,et al.  Bayesian statistical methods for genetic association studies , 2009, Nature Reviews Genetics.

[40]  M. Daly,et al.  Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions , 2009, PLoS genetics.

[41]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[42]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[43]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[44]  Paul T. Williams,et al.  Genome-Wide Association of Lipid-Lowering Response to Statins in Combined Study Populations , 2010, PloS one.

[45]  M. Marazita,et al.  Genome-wide Association Studies , 2012, Journal of dental research.