Network-constrained forest for regularized omics data classification

Contemporary molecular biology deals with a wide and heterogeneous set of measurements to model and understand underlying biological processes including complex diseases. Machine learning provides a frequent approach to build such models. However, the models built solely from measured data often suffer from overfitting, as the sample size is typically much smaller than the number of measured features. In this paper, we propose a random forest-based classifier that minimizes this overfitting with the aid of prior knowledge in the form of a feature interaction network. We illustrate the proposed method in the task of disease classification based on measured mRNA and miRNA profiles complemented by the interaction network composed of the miRNA-mRNA target relations and mRNA-mRNA interactions corresponding to the interactions between their encoded proteins. We demonstrate that the proposed network-constrained forest employs prior knowledge to increase learning bias and consequently to improve classification accuracy, stability and comprehensibility of the resulting model. The experiments are carried out in the domain of myelodysplastic syndrome that we are concerned about in the long term. We validate our approach in the public domain of ovarian carcinoma, with the same data form. We believe that the idea of a network-constrained forest can straightforwardly be generalized towards arbitrary omics data with an available and non-trivial feature interaction network.

[1]  D. Starczynowski,et al.  Deregulation of microRNAs in myelodysplastic syndrome , 2012, Leukemia.

[2]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[3]  Muin J. Khoury,et al.  Phenopedia and Genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations , 2009, Bioinform..

[4]  Susan O'Brien,et al.  Dasatinib in imatinib-resistant Philadelphia chromosome-positive leukemias. , 2006, The New England journal of medicine.

[5]  Guojuan Zhang,et al.  Cbl Controls EGFR Fate by Regulating Early Endosome Fusion , 2009, Science Signaling.

[6]  Zejuan Li,et al.  miR-495 is a tumor-suppressor microRNA down-regulated in MLL-rearranged leukemia , 2012, Proceedings of the National Academy of Sciences.

[8]  Ben Lehner,et al.  Tissue specificity and the human protein interaction network , 2009, Molecular systems biology.

[9]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[10]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[11]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[12]  V. Band,et al.  Mutant Cbl proteins as oncogenic drivers in myeloproliferative disorders , 2011, Oncotarget.

[13]  H. Lipkin Where is the ?c? , 1978 .

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[15]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[16]  Jinbo Bi,et al.  Dimensionality Reduction via Sparse Support Vector Machines , 2003, J. Mach. Learn. Res..

[17]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[18]  Jan Zahálka,et al.  Knowledge-based Subtractive Integration of mRNA and miRNA Expression Profiles to Differentiate Myelodysplastic Syndrome , 2014, BIOINFORMATICS.

[19]  Daniel Birnbaum,et al.  Combined mutations of ASXL1, CBL, FLT3, IDH1, IDH2, JAK2, KRAS, NPM1, NRAS, RUNX1, TET2 and WT1 genes in myelodysplastic syndromes and acute myeloid leukemias , 2010, BMC Cancer.

[20]  Kristiina Vuori,et al.  The proto-oncogene c-Cbl is a positive regulator of Met-induced MAP kinase activation: a role for the adaptor protein Crk , 2000, Oncogene.

[21]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[22]  Harald Binder,et al.  Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients , 2011, Biometrical journal. Biometrische Zeitschrift.

[23]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.

[24]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[25]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[26]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[27]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[28]  Weixiong Zhang,et al.  Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight , 2013, PLoS Comput. Biol..

[29]  Fabian J. Theis,et al.  A modular framework for gene set analysis integrating multilevel omics data , 2013, Nucleic acids research.

[30]  Bjørn K. Alsberg,et al.  Microarray data classification using inductive logic programming and gene ontology background information , 2010 .

[31]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[32]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[33]  Lawrence O. Hall,et al.  Ensemble diversity measures and their application to thinning , 2004, Inf. Fusion.

[34]  Paloma Valverde,et al.  Effects of Gas6 and hydrogen peroxide in Axl ubiquitination and downregulation. , 2005, Biochemical and biophysical research communications.

[35]  Nectarios Koziris,et al.  TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support , 2011, Nucleic Acids Res..

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Xiaohui S. Xie,et al.  Disease gene discovery through integrative genomics. , 2005, Annual review of genomics and human genetics.

[38]  Bruce J Mayer,et al.  Regulation of Cbl phosphorylation by the Abl tyrosine kinase and the Nck SH2/SH3 adaptor , 2001, Oncogene.

[39]  Deepayan Chakrabarti,et al.  Speeding up large-scale learning with a social prior , 2013, KDD.

[40]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[41]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[42]  Jirí Kléma,et al.  Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  John Blitzer,et al.  Regularized Learning with Networks of Features , 2008, NIPS.

[44]  Yi Su,et al.  Knowledge integration into language models: a random forest approach , 2009 .

[45]  Norbert Gretz,et al.  miRWalk - Database: Prediction of possible miRNA binding sites by "walking" the genes of three genomes , 2011, J. Biomed. Informatics.

[46]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[47]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[48]  Eva Budinska,et al.  A distinct expression of various gene subsets in CD34+ cells from patients with early and advanced myelodysplastic syndrome. , 2010, Leukemia research.

[49]  Trey Ideker,et al.  Protein Networks as Logic Functions in Development and Cancer , 2011, PLoS Comput. Biol..

[50]  Jan Kalina,et al.  Classification methods for high-dimensional genetic data , 2014 .

[51]  Xionghui Zhou,et al.  Ensemble classifier based on context specific miRNA regulation modules: a new method for cancer outcome prediction , 2013, BMC Bioinformatics.

[52]  A. Grigoriev A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. , 2001, Nucleic acids research.

[53]  Qi Zhu,et al.  Genetic variations of DNA repair genes and their prognostic significance in patients with acute myeloid leukemia , 2011, International journal of cancer.

[54]  S. Bapat,et al.  Enhanced levels of double-strand DNA break repair proteins protect ovarian cancer cells against genotoxic stress-induced apoptosis , 2013, Journal of Ovarian Research.

[55]  Ron Shamir,et al.  Network-induced Classification Kernels for Gene Expression Profile Analysis , 2012 .

[56]  Wei Pan,et al.  Network-based support vector machine for classification of microarray samples , 2009, BMC Bioinformatics.

[57]  Holger Fröhlich,et al.  pathClass: an R-package for integration of pathway knowledge into support vector machines for biomarker discovery , 2011, Bioinform..

[58]  Filip Zelezný,et al.  Comparative evaluation of set-level techniques in predictive classification of gene expression samples , 2012, BMC Bioinformatics.

[59]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[60]  Brunangelo Falini,et al.  Translocations and mutations involving the nucleophosmin (NPM1) gene in lymphomas and leukemias. , 2007, Haematologica.

[61]  Nahum Sonenberg,et al.  The mechanics of miRNA-mediated gene silencing: a look under the hood of miRISC , 2012, Nature Structural &Molecular Biology.