Network-constrained forest for regularized classification of omics data.

Contemporary molecular biology deals with wide and heterogeneous sets of measurements to model and understand underlying biological processes including complex diseases. Machine learning provides a frequent approach to build such models. However, the models built solely from measured data often suffer from overfitting, as the sample size is typically much smaller than the number of measured features. In this paper, we propose a random forest-based classifier that reduces this overfitting with the aid of prior knowledge in the form of a feature interaction network. We illustrate the proposed method in the task of disease classification based on measured mRNA and miRNA profiles complemented by the interaction network composed of the miRNA-mRNA target relations and mRNA-mRNA interactions corresponding to the interactions between their encoded proteins. We demonstrate that the proposed network-constrained forest employs prior knowledge to increase learning bias and consequently to improve classification accuracy, stability and comprehensibility of the resulting model. The experiments are carried out in the domain of myelodysplastic syndrome that we are concerned about in the long term. We validate our approach in the public domain of ovarian carcinoma, with the same data form. We believe that the idea of a network-constrained forest can straightforwardly be generalized towards arbitrary omics data with an available and non-trivial feature interaction network. The proposed method is publicly available in terms of miXGENE system (http://mixgene.felk.cvut.cz), the workflow that implements the myelodysplastic syndrome experiments is presented as a dedicated case study.

[1]  Miron B. Kursa,et al.  Robustness of Random Forest-based gene selection methods , 2013, BMC Bioinformatics.

[2]  Bjørn K. Alsberg,et al.  Microarray data classification using inductive logic programming and gene ontology background information , 2010 .

[3]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[4]  Ron Shamir,et al.  Network-induced Classification Kernels for Gene Expression Profile Analysis , 2012 .

[5]  Satoru Miyano,et al.  Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks , 2004, J. Bioinform. Comput. Biol..

[6]  Emmanuel Barillot,et al.  Classification of microarray data using gene networks , 2007, BMC Bioinformatics.

[7]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Xionghui Zhou,et al.  Ensemble classifier based on context specific miRNA regulation modules: a new method for cancer outcome prediction , 2013, BMC Bioinformatics.

[10]  Bruce J Mayer,et al.  Regulation of Cbl phosphorylation by the Abl tyrosine kinase and the Nck SH2/SH3 adaptor , 2001, Oncogene.

[11]  Filip Zelezný,et al.  Comparative evaluation of set-level techniques in predictive classification of gene expression samples , 2012, BMC Bioinformatics.

[12]  Nada Lavrac,et al.  Learning Relational Descriptions of Differentially Expressed Gene Groups , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Xiaohui S. Xie,et al.  Disease gene discovery through integrative genomics. , 2005, Annual review of genomics and human genetics.

[15]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[16]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[17]  Norbert Gretz,et al.  miRWalk - Database: Prediction of possible miRNA binding sites by "walking" the genes of three genomes , 2011, J. Biomed. Informatics.

[18]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[19]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[20]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[21]  Qi Zhu,et al.  Genetic variations of DNA repair genes and their prognostic significance in patients with acute myeloid leukemia , 2011, International journal of cancer.

[22]  Guojuan Zhang,et al.  Cbl Controls EGFR Fate by Regulating Early Endosome Fusion , 2009, Science Signaling.

[23]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[24]  Nectarios Koziris,et al.  TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support , 2011, Nucleic Acids Res..

[25]  Wei Pan,et al.  Network-based support vector machine for classification of microarray samples , 2009, BMC Bioinformatics.

[26]  V. Band,et al.  Mutant Cbl proteins as oncogenic drivers in myeloproliferative disorders , 2011, Oncotarget.

[27]  Kristiina Vuori,et al.  The proto-oncogene c-Cbl is a positive regulator of Met-induced MAP kinase activation: a role for the adaptor protein Crk , 2000, Oncogene.

[28]  Fabian J. Theis,et al.  A modular framework for gene set analysis integrating multilevel omics data , 2013, Nucleic acids research.

[29]  Lawrence O. Hall,et al.  Ensemble diversity measures and their application to thinning , 2004, Inf. Fusion.

[30]  Jirí Kléma,et al.  Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Trey Ideker,et al.  Protein Networks as Logic Functions in Development and Cancer , 2011, PLoS Comput. Biol..

[32]  Weixiong Zhang,et al.  Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight , 2013, PLoS Comput. Biol..

[33]  Deepayan Chakrabarti,et al.  Speeding up large-scale learning with a social prior , 2013, KDD.

[34]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[35]  Jan Kalina,et al.  Classification methods for high-dimensional genetic data , 2014 .

[36]  D. Starczynowski,et al.  Deregulation of microRNAs in myelodysplastic syndrome , 2012, Leukemia.

[37]  S. Bapat,et al.  Enhanced levels of double-strand DNA break repair proteins protect ovarian cancer cells against genotoxic stress-induced apoptosis , 2013, Journal of Ovarian Research.

[38]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[39]  A. Grigoriev A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. , 2001, Nucleic acids research.

[40]  Brunangelo Falini,et al.  Translocations and mutations involving the nucleophosmin (NPM1) gene in lymphomas and leukemias. , 2007, Haematologica.

[41]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[42]  Ben Lehner,et al.  Tissue specificity and the human protein interaction network , 2009, Molecular systems biology.

[43]  Harald Binder,et al.  Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients , 2011, Biometrical journal. Biometrische Zeitschrift.

[44]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[45]  Zejuan Li,et al.  miR-495 is a tumor-suppressor microRNA down-regulated in MLL-rearranged leukemia , 2012, Proceedings of the National Academy of Sciences.

[46]  Paloma Valverde,et al.  Effects of Gas6 and hydrogen peroxide in Axl ubiquitination and downregulation. , 2005, Biochemical and biophysical research communications.

[47]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[48]  Daniel Birnbaum,et al.  Combined mutations of ASXL1, CBL, FLT3, IDH1, IDH2, JAK2, KRAS, NPM1, NRAS, RUNX1, TET2 and WT1 genes in myelodysplastic syndromes and acute myeloid leukemias , 2010, BMC Cancer.

[49]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[50]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[51]  Susan O'Brien,et al.  Dasatinib in imatinib-resistant Philadelphia chromosome-positive leukemias. , 2006, The New England journal of medicine.

[52]  Eva Budinska,et al.  A distinct expression of various gene subsets in CD34+ cells from patients with early and advanced myelodysplastic syndrome. , 2010, Leukemia research.

[53]  Holger Fröhlich,et al.  pathClass: an R-package for integration of pathway knowledge into support vector machines for biomarker discovery , 2011, Bioinform..

[54]  Nahum Sonenberg,et al.  The mechanics of miRNA-mediated gene silencing: a look under the hood of miRISC , 2012, Nature Structural &Molecular Biology.

[55]  Muin J. Khoury,et al.  Phenopedia and Genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations , 2009, Bioinform..

[56]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .