Network inference with ensembles of bi-clustering trees

BackgroundNetwork inference is crucial for biomedicine and systems biology. Biological entities and their associations are often modeled as interaction networks. Examples include drug protein interaction or gene regulatory networks. Studying and elucidating such networks can lead to the comprehension of complex biological processes. However, usually we have only partial knowledge of those networks and the experimental identification of all the existing associations between biological entities is very time consuming and particularly expensive. Many computational approaches have been proposed over the years for network inference, nonetheless, efficiency and accuracy are still persisting open problems. Here, we propose bi-clustering tree ensembles as a new machine learning method for network inference, extending the traditional tree-ensemble models to the global network setting. The proposed approach addresses the network inference problem as a multi-label classification task. More specifically, the nodes of a network (e.g., drugs or proteins in a drug-protein interaction network) are modelled as samples described by features (e.g., chemical structure similarities or protein sequence similarities). The labels in our setting represent the presence or absence of links connecting the nodes of the interaction network (e.g., drug-protein interactions in a drug-protein interaction network).ResultsWe extended traditional tree-ensemble methods, such as extremely randomized trees (ERT) and random forests (RF) to ensembles of bi-clustering trees, integrating background information from both node sets of a heterogeneous network into the same learning framework. We performed an empirical evaluation, comparing the proposed approach to currently used tree-ensemble based approaches as well as other approaches from the literature. We demonstrated the effectiveness of our approach in different interaction prediction (network inference) settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein and gene regulatory networks. We also applied our proposed method to two versions of a chemical-protein association network extracted from the STITCH database, demonstrating the potential of our model in predicting non-reported interactions.ConclusionsBi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms. Since our approach is based on tree-ensembles it inherits the advantages of tree-ensemble learning, such as handling of missing values, scalability and interpretability.

[1]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[2]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[3]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[4]  T. Ashburn,et al.  Drug repositioning: identifying and developing new uses for existing drugs , 2004, Nature Reviews Drug Discovery.

[5]  Yun Xie,et al.  Identification of drug-target interaction from interactome network with 'guilt-by-association' principle and topology features , 2016, Bioinform..

[6]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[7]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[8]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[9]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[10]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[11]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[12]  Chee Keong Kwoh,et al.  Drug-target interaction prediction via class imbalance-aware ensemble learning , 2016, BMC Bioinformatics.

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[15]  Piet Demeester,et al.  Netter: re-ranking gene network inference predictions using structural network properties , 2016, BMC Bioinformatics.

[16]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[17]  Feng Liu,et al.  Predicting drug side effects by multi-label learning and ensemble learning , 2015, BMC Bioinformatics.

[18]  Pierre Geurts,et al.  Supervised learning with decision tree-based methods in computational and systems biology. , 2009, Molecular bioSystems.

[19]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Pierre Geurts,et al.  Classifying pairs with trees for supervised biological network inference† †Electronic supplementary information (ESI) available: Implementation and computational issues, supplementary performance curves, and illustration of interpretability of trees. See DOI: 10.1039/c5mb00174a Click here for additi , 2014, Molecular bioSystems.

[21]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[22]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[23]  Vladimir B. Bajic,et al.  DDR: efficient computational method to predict drug–target interactions using graph mining and machine learning approaches , 2017, Bioinform..

[24]  Keqin Li,et al.  Predicting Drug–Target Interactions With Multi-Information Fusion , 2017, IEEE Journal of Biomedical and Health Informatics.

[25]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[26]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[27]  Hendrik Blockeel,et al.  Seeing the Forest Through the Trees: Learning a Comprehensible Model from an Ensemble , 2007, ECML.

[28]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[29]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[30]  Shi-Hua Zhang,et al.  DrugE-Rank: improving drug–target interaction prediction of new candidate drugs or targets by ensemble learning to rank , 2016, Bioinform..

[31]  Chee-Keong Kwoh,et al.  Computational prediction of drug-target interactions using chemogenomic approaches: an empirical survey , 2019, Briefings Bioinform..

[32]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[33]  Mark A. Ragan,et al.  Supervised, semi-supervised and unsupervised inference of gene regulatory networks , 2013, Briefings Bioinform..

[34]  Christian von Mering,et al.  STITCH: interaction networks of chemicals and proteins , 2007, Nucleic Acids Res..

[35]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Chee Keong Kwoh,et al.  Drug-target interaction prediction by learning from local information and neighbors , 2013, Bioinform..

[37]  Pierre Geurts,et al.  Global multi-output decision trees for interaction prediction , 2018, Machine Learning.

[38]  Michael J. Keiser,et al.  Large Scale Prediction and Testing of Drug Activity on Side-Effect Targets , 2012, Nature.

[39]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[40]  Yong-Yeol Ahn,et al.  Optimizing drug–target interaction prediction based on random walk on heterogeneous networks , 2015, Journal of Cheminformatics.

[41]  William Stafford Noble,et al.  A new pairwise kernel for biological network inference with support vector machines , 2007, BMC Bioinformatics.

[42]  Chunyan Miao,et al.  Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction , 2016, PLoS Comput. Biol..

[43]  Anna Korhonen,et al.  Link prediction in drug-target interactions network using similarity indices , 2017, BMC Bioinformatics.

[44]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  Celine Vens,et al.  Predicting gene function in S. cerevisiae and A. thaliana using hierarchical multi-label decision tree ensembles , 2008 .

[47]  Elena Marchiori,et al.  Gaussian interaction profile kernels for predicting drug-target interaction , 2011, Bioinform..

[48]  Saso Dzeroski,et al.  Tree ensembles for predicting structured outputs , 2013, Pattern Recognit..

[49]  Ivan G. Costa,et al.  A multiple kernel learning algorithm for drug-target interaction prediction , 2016, BMC Bioinformatics.

[50]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[51]  Pierre Geurts,et al.  On protocols and measures for the validation of supervised methods for the inference of biological networks , 2013, Front. Genet..

[52]  Kiyoshi Asai,et al.  The em Algorithm for Kernel Matrix Completion with Auxiliary Data , 2003, J. Mach. Learn. Res..

[53]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[54]  Jennifer Venhorst,et al.  Target-drug interactions: first principles and their application to drug discovery. , 2012, Drug discovery today.

[55]  David Craft,et al.  The value of prior knowledge in machine learning of complex network systems , 2016, bioRxiv.

[56]  Michelangelo Ceci,et al.  Semi-Supervised Multi-View Learning for Gene Network Reconstruction , 2015, SEBD.

[57]  Sampo Pyysalo,et al.  Neural networks for link prediction in realistic biomedical graphs: a multi-dimensional evaluation of graph embedding-based approaches , 2018, BMC Bioinformatics.

[58]  Eyke Hüllermeier,et al.  Multi-target prediction: a unifying view on problems and methods , 2018, Data Mining and Knowledge Discovery.

[59]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[60]  Jean-Philippe Vert,et al.  Reconstruction of Biological Networks by Supervised Machine Learning Approaches , 2008 .

[61]  Jian-Yu Shi,et al.  Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering. , 2015, Methods.