Feature Induction and Network Mining with Clustering Tree Ensembles

The volume of data generated and collected using modern technologies grows exponentially. This vast amount of data often follows a complex structure, and the problem of efficiently mining and analyzing such data is crucial for the performance of various machine learning tasks. Here, a novel data mining framework for unsupervised learning tasks is proposed based on decision tree learning and ensembles of trees. The proposed approach introduces an informative feature representation and is able to handle data diversity and complexity. Moreover, a new scheme is proposed based on the aforementioned approach for mining interaction data. These data are often modeled as homogeneous or heterogeneous networks and they are present in various fields, such as social media, recommender systems, and bioinformatics. The learning process is performed in an unsupervised manner, following also the inductive setup. The experimental evaluation confirms the effectiveness of the proposed approach.

[1]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[2]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[3]  Kumaran Kandasamy,et al.  An evaluation of human protein-protein interaction data in the public domain , 2006, BMC Bioinformatics.

[4]  Saso Dzeroski,et al.  Tree ensembles for predicting structured outputs , 2013, Pattern Recognit..

[5]  Michelangelo Ceci,et al.  Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction , 2013, BMC Bioinformatics.

[6]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[7]  Frédéric Jurie,et al.  Randomized Clustering Forests for Image Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  L. Hubert,et al.  Comparing partitions , 1985 .

[9]  Yoshihiro Yamanishi,et al.  Extracting Sets of Chemical Substructures and Protein Domains Governing Drug-Target Interactions , 2011, J. Chem. Inf. Model..

[10]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[11]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[12]  William Stafford Noble,et al.  A new pairwise kernel for biological network inference with support vector machines , 2007, BMC Bioinformatics.

[13]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[14]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[15]  Mark A. Ragan,et al.  Supervised, semi-supervised and unsupervised inference of gene regulatory networks , 2013, Briefings Bioinform..

[16]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[17]  Celine Vens,et al.  Random Forest Based Feature Induction , 2011, 2011 IEEE 11th International Conference on Data Mining.

[18]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[19]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[20]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[23]  Lei Wu,et al.  Lift: Multi-Label Learning with Label-Specific Features , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Pierre Geurts,et al.  Supervised learning with decision tree-based methods in computational and systems biology. , 2009, Molecular bioSystems.

[25]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Yoshihiro Yamanishi,et al.  Supervised enzyme network inference from the integration of genomic data and chemical information , 2005, ISMB.

[28]  Michelangelo Ceci,et al.  Ensembles of Extremely Randomized Trees for Multi-target Regression , 2015, Discovery Science.

[29]  Pierre Geurts,et al.  Classifying pairs with trees for supervised biological network inference† †Electronic supplementary information (ESI) available: Implementation and computational issues, supplementary performance curves, and illustration of interpretability of trees. See DOI: 10.1039/c5mb00174a Click here for additi , 2014, Molecular bioSystems.

[30]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..