A novel variational Bayes multiple locus Z-statistic for genome-wide association studies with Bayesian model averaging

MOTIVATION For many complex traits, including height, the majority of variants identified by genome-wide association studies (GWAS) have small effects, leaving a significant proportion of the heritable variation unexplained. Although many penalized multiple regression methodologies have been proposed to increase the power to detect associations for complex genetic architectures, they generally lack mechanisms for false-positive control and diagnostics for model over-fitting. Our methodology is the first penalized multiple regression approach that explicitly controls Type I error rates and provide model over-fitting diagnostics through a novel normally distributed statistic defined for every marker within the GWAS, based on results from a variational Bayes spike regression algorithm. RESULTS We compare the performance of our method to the lasso and single marker analysis on simulated data and demonstrate that our approach has superior performance in terms of power and Type I error control. In addition, using the Women's Health Initiative (WHI) SNP Health Association Resource (SHARe) GWAS of African-Americans, we show that our method has power to detect additional novel associations with body height. These findings replicate by reaching a stringent cutoff of marginal association in a larger cohort. AVAILABILITY An R-package, including an implementation of our variational Bayes spike regression (vBsr) algorithm, is available at http://kooperberg.fhcrc.org/soft.html.

[1]  Jason H. Moore,et al.  Missing heritability and strategies for finding the underlying causes of complex disease , 2010, Nature Reviews Genetics.

[2]  T. Yen A majorization–minimization approach to variable selection using spike and slab priors , 2010, 1005.0891.

[3]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[4]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[5]  JoAnn E. Manson,et al.  Design of the Women's Health Initiative clinical trial and observational study. The Women's Health Initiative Study Group. , 1998, Controlled clinical trials.

[6]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[7]  Nicholas G. Polson,et al.  Data augmentation for support vector machines , 2011 .

[8]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[9]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[10]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[11]  Nicholas A. Johnson,et al.  Genome-wide association study of body height in African Americans: the Women's Health Initiative SNP Health Association Resource (SHARe). , 2012, Human molecular genetics.

[12]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[13]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[14]  D. Bowden,et al.  Loci Contributing to Adult Height and Body Mass Index in African American Families Ascertained for Type 2 Diabetes , 2005, Annals of human genetics.

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[17]  M. McCarthy,et al.  Genome-wide association studies: potential next steps on a genetic journey. , 2008, Human molecular genetics.

[18]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[19]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[20]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[21]  Gary K. Chen,et al.  Identification, Replication, and Fine-Mapping of Loci Associated with Adult Height in Individuals of African Ancestry , 2011, PLoS genetics.

[22]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[23]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[24]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[25]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[26]  David M. Evans,et al.  Genome-wide association analysis identifies 20 loci that influence adult height , 2008, Nature Genetics.

[27]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[28]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[29]  D. Allison,et al.  Beyond Missing Heritability: Prediction of Complex Traits , 2011, PLoS genetics.

[30]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[31]  E. Lander,et al.  Genomewide linkage analysis of stature in multiple populations reveals several regions with evidence of linkage to adult height. , 2001, American journal of human genetics.

[32]  G. Parisi,et al.  Statistical Field Theory , 1988 .

[33]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[34]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[35]  G. Gibson Hints of hidden heritability in GWAS , 2010, Nature Genetics.