Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.

[1]  A V Hershey,et al.  Computation of Special Functions , 1978 .

[2]  I. S. Gradshteyn Table of Integrals, Series and Products, Corrected and Enlarged Edition , 1980 .

[3]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[4]  D. F. Hays,et al.  Table of Integrals, Series, and Products , 1966 .

[5]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[6]  J. Q. Smith,et al.  1. Bayesian Statistics 4 , 1993 .

[7]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[10]  P. Sasieni From genotypes to genes: doubling the sample size. , 1997, Biometrics.

[11]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[12]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[13]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[14]  T. Fearn,et al.  Bayes model averaging with selection of regressors , 2002 .

[15]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[16]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[17]  David J. Lunn,et al.  A Bayesian toolkit for genetic association studies , 2006, Genetic epidemiology.

[18]  C. Holmes,et al.  Bayesian auxiliary variable models for binary and multinomial regression , 2006 .

[19]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[20]  T. Hudson,et al.  A genome-wide association study identifies novel risk loci for type 2 diabetes , 2007, Nature.

[21]  C. Hoggart,et al.  Sequence-Level Population Simulations Over Large Genomic Regions , 2007, Genetics.

[22]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[23]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[24]  J. Griffin,et al.  Bayesian adaptive lassos with non-convex penalization , 2007 .

[25]  Peter M Visscher,et al.  Prediction of individual genetic risk to disease from genome-wide association studies. , 2007, Genome research.

[26]  J. Pankow,et al.  Prediction of coronary heart disease risk using a genetic risk score: the Atherosclerosis Risk in Communities Study. , 2007, American journal of epidemiology.

[27]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[28]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .