Integrating gene set analysis and nonlinear predictive modeling of disease phenotypes using a Bayesian multitask formulation

BackgroundIdentifying molecular signatures of disease phenotypes is studied using two mainstream approaches: (i) Predictive modeling methods such as linear classification and regression algorithms are used to find signatures predictive of phenotypes from genomic data, which may not be robust due to limited sample size or highly correlated nature of genomic data. (ii) Gene set analysis methods are used to find gene sets on which phenotypes are linearly dependent by bringing prior biological knowledge into the analysis, which may not capture more complex nonlinear dependencies. Thus, formulating an integrated model of gene set analysis and nonlinear predictive modeling is of great practical importance.ResultsIn this study, we propose a Bayesian binary classification framework to integrate gene set analysis and nonlinear predictive modeling. We then generalize this formulation to multitask learning setting to model multiple related datasets conjointly. Our main novelty is the probabilistic nonlinear formulation that enables us to robustly capture nonlinear dependencies between genomic data and phenotype even with small sample sizes. We demonstrate the performance of our algorithms using repeated random subsampling validation experiments on two cancer and two tuberculosis datasets by predicting important disease phenotypes from genome-wide gene expression data.ConclusionsWe are able to obtain comparable or even better predictive performance than a baseline Bayesian nonlinear algorithm and to identify sparse sets of relevant genes and gene sets on all datasets. We also show that our multitask learning formulation enables us to further improve the generalization performance and to better understand biological processes behind disease phenotypes.

[1]  Miguel Lázaro-Gredilla,et al.  Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning , 2011, NIPS.

[2]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[3]  Mehmet G nen Bayesian Efficient Multiple Kernel Learning , 2012, ICML 2012.

[4]  Wei Pan,et al.  Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data , 2007, Bioinform..

[5]  Colin Campbell,et al.  A pathway-based data integration framework for prediction of disease progression , 2013, Bioinform..

[6]  E2f1, E2f2, and E2f3 Control E2F Target Expression and Cellular Proliferation via a p53-Dependent Negative Feedback Loop , 2012, Molecular and Cellular Biology.

[7]  Michael Levin,et al.  Detection of Tuberculosis in HIV-Infected and -Uninfected African Adults Using Whole Blood RNA Expression Signatures: A Case-Control Study , 2013, PLoS medicine.

[8]  S. Gruber,et al.  Microsatellite instability in colorectal cancer—the stable evidence , 2010, Nature Reviews Clinical Oncology.

[9]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[10]  A. Keshtkar,et al.  Diagnostic accuracy of IL-2 for the diagnosis of latent tuberculosis: a systematic review and meta-analysis , 2014, European Journal of Clinical Microbiology & Infectious Diseases.

[11]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[12]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[13]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[14]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[15]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[16]  D. Kaushal,et al.  Role of interleukin 6 in innate immunity to Mycobacterium tuberculosis infection. , 2013, The Journal of infectious diseases.

[17]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[18]  Neil D. Lawrence,et al.  Semi-supervised Learning via Gaussian Processes , 2004, NIPS.

[19]  L. Coin,et al.  Diagnosis of childhood tuberculosis and host RNA expression in Africa. , 2014, The New England journal of medicine.

[20]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[21]  K. R. Gottipati,et al.  Early Secreted Antigenic Target of 6 kDa (ESAT-6) Protein of Mycobacterium tuberculosis Induces Interleukin-8 (IL-8) Expression in Lung Epithelial Cells via Protein Kinase Signaling and Reactive Oxygen Species* , 2013, The Journal of Biological Chemistry.

[22]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[23]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[24]  Mehmet Gönen,et al.  Bayesian Efficient Multiple Kernel Learning , 2012, ICML.

[25]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[26]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[27]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[28]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[29]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[30]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[31]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[32]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .