Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology

BackgroundAn important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data.ResultsWe put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information.ConclusionsThe empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge.

[1]  D. Ginsberg,et al.  Transcriptional regulation of AKT activation by E2F. , 2004, Molecular cell.

[2]  Simon Rogers,et al.  A Bayesian regression approach to the inference of regulatory networks from gene expression data , 2005, Bioinform..

[3]  Dirk Husmeier,et al.  Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks , 2003, Bioinform..

[4]  Y. Yarden,et al.  Untangling the ErbB signalling network , 2001, Nature Reviews Molecular Cell Biology.

[5]  Sach Mukherjee,et al.  Network inference using informative priors , 2008, Proceedings of the National Academy of Sciences.

[6]  Z. Q. John Lu,et al.  Bayesian methods for data analysis, third edition , 2010 .

[7]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[8]  Edward I. George,et al.  The Practical Implementation of Bayesian Model Selection , 2001 .

[9]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[10]  T. Fearn,et al.  Bayes model averaging with selection of regressors , 2002 .

[11]  Wen-Lin Kuo,et al.  A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. , 2006, Cancer cell.

[12]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[13]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[14]  Joshua M. Stuart,et al.  Subtype and pathway specific responses to anticancer compounds in breast cancer , 2011, Proceedings of the National Academy of Sciences.

[15]  D. Husmeier,et al.  Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge , 2007, Statistical applications in genetics and molecular biology.

[16]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[17]  Robert Kohn,et al.  Nonparametric regression using linear combinations of basis functions , 2001, Stat. Comput..

[18]  N. Zhang,et al.  Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces With Applications in Genomics , 2010 .

[19]  J. Nevins,et al.  The Rb/E2F pathway and cancer. , 2001, Human molecular genetics.

[20]  Alexander J. Hartemink,et al.  Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data , 2004, Pacific Symposium on Biocomputing.

[21]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[22]  B. Burgering,et al.  Protein kinase B (c-Akt) in phosphatidylinositol-3-OH kinase signal transduction , 1995, Nature.

[23]  M. Yuan,et al.  Efficient Empirical Bayes Variable Selection and Estimation in Linear Models , 2005 .

[24]  Wenxin Jiang Bayesian variable selection for high dimensional generalized linear models : Convergence rates of the fitted densities , 2007, 0710.3458.

[25]  Christian J. Stoeckert,et al.  Bayesian variable selection and data integration for biological regulatory networks , 2006, math/0610034.

[26]  Refik Soyer,et al.  Bayesian Methods for Nonlinear Classification and Regression , 2004, Technometrics.

[27]  Harald Binder,et al.  Incorporating pathway information into boosting estimation of high-dimensional risk prediction models , 2009, BMC Bioinformatics.

[28]  Wei Pan,et al.  Network-based support vector machine for classification of microarray samples , 2009, BMC Bioinformatics.

[29]  P. Green,et al.  Bayesian Variable Selection and the Swendsen-Wang Algorithm , 2004 .

[30]  Terence P. Speed,et al.  Sparse combinatorial inference with an application in cancer biology , 2009, Bioinform..

[31]  Hongzhe Li,et al.  Bayesian Methods for Network-Structured Genomics Data , 2010 .

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  Andrew D. Hamilton,et al.  Akt/Protein Kinase B Signaling Inhibitor-2, a Selective Small Molecule Inhibitor of Akt Signaling with Antitumor Activity in Cancer Cells Overexpressing Akt , 2004, Cancer Research.

[34]  R. Weinberg,et al.  The Biology of Cancer , 2006 .

[35]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[36]  Marina Vannucci,et al.  Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data , 2011, Bioinform..

[37]  Hongzhe Li,et al.  A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data , 2008, 0803.3942.

[38]  Gerhard Tutz,et al.  Feature selection guided by structural information , 2010, 1011.2315.

[39]  R. Kohn,et al.  Nonparametric regression using Bayesian variable selection , 1996 .

[40]  Bradley P. Carlin,et al.  Bayesian Methods for Data Analysis , 2008 .

[41]  Hugh Chipman,et al.  Bayesian variable selection with related predictors , 1995, bayes-an/9510001.

[42]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[43]  Xin-Yuan Song,et al.  Bayesian variable selection for disease classification using gene expression data , 2010, Bioinform..

[44]  J. York,et al.  Bayesian Graphical Models for Discrete Data , 1995 .

[45]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[46]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .