The Bayesian lasso for genome-wide association studies

MOTIVATION Despite their success in identifying genes that affect complex disease or traits, current genome-wide association studies (GWASs) based on a single SNP analysis are too simple to elucidate a comprehensive picture of the genetic architecture of phenotypes. A simultaneous analysis of a large number of SNPs, although statistically challenging, especially with a small number of samples, is crucial for genetic modeling. METHOD We propose a two-stage procedure for multi-SNP modeling and analysis in GWASs, by first producing a 'preconditioned' response variable using a supervised principle component analysis and then formulating Bayesian lasso to select a subset of significant SNPs. The Bayesian lasso is implemented with a hierarchical model, in which scale mixtures of normal are used as prior distributions for the genetic effects and exponential priors are considered for their variances, and then solved by using the Markov chain Monte Carlo (MCMC) algorithm. Our approach obviates the choice of the lasso parameter by imposing a diffuse hyperprior on it and estimating it along with other parameters and is particularly powerful for selecting the most relevant SNPs for GWASs, where the number of predictors exceeds the number of observations. RESULTS The new approach was examined through a simulation study. By using the approach to analyze a real dataset from the Framingham Heart Study, we detected several significant genes that are associated with body mass index (BMI). Our findings support the previous results about BMI-related SNPs and, meanwhile, gain new insights into the genetic control of this trait. AVAILABILITY The computer code for the approach developed is available at Penn State Center for Statistical Genetics web site, http://statgen.psu.edu.

[1]  R. Tibshirani,et al.  "Preconditioning" for feature selection and regression in high-dimensional problems , 2007, math/0703858.

[2]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[3]  Yongdai Kim,et al.  Smoothly Clipped Absolute Deviation on High Dimensions , 2008 .

[4]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[5]  D. F. Andrews,et al.  Scale Mixtures of Normal Distributions , 1974 .

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  M. Sillanpää,et al.  Bayesian mapping of genotype × expression interactions in quantitative and qualitative traits , 2006, Heredity.

[8]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[9]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[10]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[11]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[12]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[13]  Christian Wolfrum,et al.  Role of Foxa-2 in adipocyte metabolism and differentiation. , 2003, The Journal of clinical investigation.

[14]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[15]  C. Jaquish,et al.  The Framingham Heart Study, on its way to becoming the gold standard for Cardiovascular Genetic Epidemiology? , 2007, BMC Medical Genetics.

[16]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[17]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[18]  T. Dawber,et al.  Epidemiological approaches to heart disease: the Framingham Study. , 1951, American journal of public health and the nation's health.

[19]  Catherine Dulac Brain function and chromatin plasticity , 2010, Nature.

[20]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[21]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[22]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[23]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[24]  References , 1971 .

[25]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[26]  D. Madigan,et al.  [Least Angle Regression]: Discussion , 2004 .

[27]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[28]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[29]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[30]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[31]  N. Yi,et al.  Bayesian LASSO for Quantitative Trait Loci Mapping , 2008, Genetics.

[32]  Trevor Hastie,et al.  High-Dimensional Problems: p N , 2009 .

[33]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[34]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[35]  Peter Donnelly,et al.  Progress and challenges in genome-wide association studies in humans , 2008, Nature.