Stochastic model search with binary outcomes for genome-wide association studies

Objective The spread of case–control genome-wide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses. Materials and methods Our method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor. Results BOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes. Discussion BOSS outperforms the other methods in terms of F-measure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques. Conclusion The proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[3]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[4]  Kristin G Ardlie,et al.  Common Single Nucleotide Polymorphisms in TCF7L2 Are Reproducibly Associated With Type 2 Diabetes and Reduce the Insulin Response to Glucose in Nondiabetic Individuals , 2006, Diabetes.

[5]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[6]  B. Fridley Bayesian variable and model selection methods for genetic association studies , 2009, Genetic epidemiology.

[7]  M. Jarvelin,et al.  A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity , 2007, Science.

[8]  L. Milanesi,et al.  Association study on long-living individuals from Southern Italy identifies rs10491334 in the CAMKIV gene that regulates survival proteins. , 2011, Rejuvenation research.

[9]  David M Nathan,et al.  TCF7L2 polymorphisms and progression to diabetes in the Diabetes Prevention Program. , 2006, The New England journal of medicine.

[10]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[11]  Peter Kraft,et al.  Genetic variants at 2q24 are associated with susceptibility to type 2 diabetes. , 2010, Human molecular genetics.

[12]  Mark Girolami,et al.  Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors , 2006, Neural Computation.

[13]  Sylvia Richardson,et al.  Evolutionary Stochastic Search for Bayesian model exploration , 2010, 1002.2706.

[14]  Scott M. Williams,et al.  Preterm Birth in Caucasians Is Associated with Coagulation and Inflammation Pathway Gene Variants , 2008, PloS one.

[15]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[16]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[17]  Robert Kohn,et al.  Nonparametric regression using linear combinations of basis functions , 2001, Stat. Comput..

[18]  J. Hu,et al.  Polymorphisms in CFH, HTRA1 and CX3CR1 confer risk to exudative age-related macular degeneration in Han Chinese , 2010, British Journal of Ophthalmology.

[19]  Beverley Balkau,et al.  Variation in FTO contributes to childhood obesity and severe adult obesity , 2007, Nature Genetics.

[20]  Shyam Visweswaran,et al.  The application of naive Bayes model averaging to predict Alzheimer's disease from genome-wide data , 2011, J. Am. Medical Informatics Assoc..

[21]  C. Langefeld,et al.  Association of polymorphisms in the klotho gene with severity of non-diabetic ESRD in African Americans. , 2010, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[22]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[23]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[24]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[25]  Robert S. Leiken,et al.  A User’s Guide , 2011 .

[26]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[27]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[28]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[29]  Amos J. Storkey,et al.  Sparse Instrumental Variables (SPIV) for Genome-Wide Studies , 2010, NIPS.

[30]  N. Lytkin,et al.  Causal graph-based analysis of genome-wide association data in rheumatoid arthritis , 2011, Biology Direct.