Feature Induction based on Extremely Randomized Tree Paths

The volume of data generated and collected using modern technologies grows exponentially. This vast amount of data often follows a complex structure, significantly affecting the performance of various machine learning tasks. Despite the effort made, the problem of efficiently mining and analyzing such data is still persisting. Here, a novel data mining framework for unsupervised learning tasks is proposed based on decision tree learning and ensembles of trees. The proposed approach introduces an informative feature representation and is able to handle data diversity (e.g., numerical, canonical, etc.) and complexity (e.g., graphs, networks, data containing missing values etc.). Learning is performed in an unsupervised manner, following also the inductive setup. The experimental evaluation confirms the effectiveness of the proposed approach.

[1]  Luc De Raedt,et al.  Top-down induction of logical decision trees , 1997 .

[2]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[3]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[4]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[5]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[6]  Celine Vens,et al.  Random Forest Based Feature Induction , 2011, 2011 IEEE 11th International Conference on Data Mining.

[7]  Lei Wu,et al.  Lift: Multi-Label Learning with Label-Specific Features , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[9]  Pierre Geurts,et al.  Supervised learning with decision tree-based methods in computational and systems biology. , 2009, Molecular bioSystems.

[10]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[11]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[12]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[13]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Saso Dzeroski,et al.  Tree ensembles for predicting structured outputs , 2013, Pattern Recognit..

[16]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[17]  Pierre Geurts,et al.  Classifying pairs with trees for supervised biological network inference† †Electronic supplementary information (ESI) available: Implementation and computational issues, supplementary performance curves, and illustration of interpretability of trees. See DOI: 10.1039/c5mb00174a Click here for additi , 2014, Molecular bioSystems.

[18]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[19]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .