Incorporating grouping information in bayesian variable selection with applications in genomics

In many applications it is of interest to determine a limited number of important explanatory factors (representing groups of potentially overlapping predictors) rather than original predictor variables. The often imposed require-ment that the clustered predictors should enter the model simultaneously may be limiting as not all the variables within a group need to be associated with the out-come. Within-group sparsity is often desirable as well. Here we propose a Bayesian variable selection method, which uses the grouping information as a means of in-troducing more equal competition to enter the model within the groups rather than as a source of strict regularization constraints. This is achieved within the context of Bayesian LASSO (least absolute shrinkage and selection operator) by allowing each regression coefficient to be penalized differentially and by considering an additional regression layer to relate individual penalty parameters to a group identification matrix. The proposed hierarchical model therefore enables inference simultaneously on two levels: (1) the regression layer for the continuous outcome in relation to the predictors and (2) the regression layer for the penalty param-eters in relation to the grouping information. Both situations with overlapping and non-overlapping groups are applicable. The method does not assume within-group homogeneity across the regression coefficients, which is implicit in many structured penalized likelihood approaches. The smoothness here is enforced at the penalty level rather than within the regression coefficients. To enhance the potential of the proposed method we develop two rapid computational procedures based on the expectation maximization (EM) algorithm, which offer substantial time savings in applications where the high-dimensionality renders Markov chain Monte Carlo (MCMC) approaches less practical. We demonstrate the usefulness of our method in predicting time to death in glioblastoma patients using pathways of genes.

[1]  Dean Phillips Foster,et al.  Calibration and empirical Bayes variable selection , 2000 .

[2]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[3]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[4]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[5]  I. Bièche,et al.  Epigenetic inactivation of SLIT3 and SLIT1 genes in human cancers , 2004, British Journal of Cancer.

[6]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[7]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[8]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[9]  A. Gelfand,et al.  Proper multivariate conditional autoregressive models for spatial data analysis. , 2003, Biostatistics.

[10]  J. Griffin,et al.  BAYESIAN HYPER‐LASSOS WITH NON‐CONVEX PENALIZATION , 2011 .

[11]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[12]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  I. S. Gradshteyn,et al.  Table of Integrals, Series, and Products , 1976 .

[15]  M. Clyde,et al.  Mixtures of g Priors for Bayesian Variable Selection , 2008 .

[16]  Francesco C Stingo,et al.  A BAYESIAN GRAPHICAL MODELING APPROACH TO MICRORNA REGULATORY NETWORK INFERENCE. , 2011, The annals of applied statistics.

[17]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[18]  Mitsutoshi Nakada,et al.  Aberrant Signaling Pathways in Glioma , 2011, Cancers.

[19]  S. Horvath,et al.  Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target , 2006, Proceedings of the National Academy of Sciences.

[20]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[21]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[22]  N. Zhang,et al.  Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces With Applications in Genomics , 2010 .

[23]  J. Bruner,et al.  Comparison of cell adhesion molecule expression between glioblastoma multiforme and autologous normal brain tissue , 1995, Journal of Neuroimmunology.

[24]  J. Tonn,et al.  Interactions of glioma cells and extracellular matrix , 2005, Journal of Neuro-Oncology.

[25]  J. Ibrahim,et al.  Conjugate priors for generalized linear models , 2003 .

[26]  Jaeyong Lee,et al.  GENERALIZED DOUBLE PARETO SHRINKAGE. , 2011, Statistica Sinica.

[27]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[28]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[29]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[32]  Thomas Ludwig,et al.  Glioblastoma cells release factors that disrupt blood-brain barrier features , 2004, Acta Neuropathologica.

[33]  Paul S Mischel,et al.  Analysis of the phosphatidylinositol 3'-kinase signaling pathway in glioblastoma patients in vivo. , 2003, Cancer research.

[34]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[35]  Marina Vannucci,et al.  Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data , 2011, Bioinform..

[36]  Wei Pan,et al.  Predictor Network in Penalized Regression with Application to Microarray Data” , 2009 .

[37]  H. Kiiveri A Bayesian approach to variable selection when the number of variables is very large , 2003 .

[38]  Francesco C Stingo,et al.  INCORPORATING BIOLOGICAL INFORMATION INTO LINEAR MODELS: A BAYESIAN APPROACH TO THE SELECTION OF PATHWAYS AND GENES. , 2011, The annals of applied statistics.

[39]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[40]  Gene H. Golub,et al.  Matrix computations , 1983 .