Semi-supervised classification trees

In many real-life problems, obtaining labelled data can be a very expensive and laborious task, while unlabeled data can be abundant. The availability of labeled data can seriously limit the performance of supervised learning methods. Here, we propose a semi-supervised classification tree induction algorithm that can exploit both the labelled and unlabeled data, while preserving all of the appealing characteristics of standard supervised decision trees: being non-parametric, efficient, having good predictive performance and producing readily interpretable models. Moreover, we further improve their predictive performance by using them as base predictive models in random forests. We performed an extensive empirical evaluation on 12 binary and 12 multi-class classification datasets. The results showed that the proposed methods improve the predictive performance of their supervised counterparts. Moreover, we show that, in cases with limited availability of labeled data, the semi-supervised decision trees often yield models that are smaller and easier to interpret than supervised decision trees.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Stefan C. Kremer,et al.  Clustering unlabeled data with SOMs improves classification of labeled real-world data , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[3]  Tomislav Šmuc,et al.  Accurate models for P-gp drug recognition induced from a cancer cell line cytotoxicity screen. , 2013, Journal of medicinal chemistry.

[4]  Xiao Liu,et al.  Random Forest Construction With Robust Semisupervised Node Splitting , 2015, IEEE Transactions on Image Processing.

[5]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[6]  Saso Dzeroski,et al.  Tree ensembles for predicting structured outputs , 2013, Pattern Recognit..

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[9]  Michelangelo Ceci,et al.  Hierarchical Text Categorization in a Transductive Setting , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[10]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[11]  Zhi-Hua Zhou,et al.  Semisupervised Regression with Cotraining-Style Algorithms , 2007, IEEE Transactions on Knowledge and Data Engineering.

[12]  Nairanjana Dasgupta,et al.  Analyzing Categorical Data , 2004, Technometrics.

[13]  Robert D. Nowak,et al.  Multi-Manifold Semi-Supervised Learning , 2009, AISTATS.

[14]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.

[15]  Roberto Todeschini,et al.  Quantitative Structure − Activity Relationship Models for Ready Biodegradability of Chemicals , 2013 .

[16]  Harry Zhang,et al.  An Extensive Empirical Study on Semi-supervised Learning , 2010, 2010 IEEE International Conference on Data Mining.

[17]  Saso Dzeroski,et al.  The importance of the label hierarchy in hierarchical multi-label classification , 2015, Journal of Intelligent Information Systems.

[18]  Wei Liu,et al.  Robust and Scalable Graph-Based Semisupervised Learning , 2012, Proceedings of the IEEE.

[19]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[20]  Zhi-Hua Zhou,et al.  Semi-Supervised Regression with Co-Training Style Algorithms , 2007 .

[21]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[22]  Saso Dzeroski,et al.  Finding explained groups of time-course gene expression profiles with predictive clustering trees. , 2010, Molecular bioSystems.

[23]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[24]  Maria Petrou,et al.  Preface - Machine Learning and Data Mining in Pattern Recognition , 2001, Pattern Recognit. Lett..

[25]  Michelangelo Ceci,et al.  Semi-supervised Learning for Multi-target Regression , 2014, NFMCP.

[26]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[27]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[28]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[29]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[31]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[33]  Hamideh Afsarmanesh,et al.  Semi-supervised self-training for decision tree classifiers , 2017, Int. J. Mach. Learn. Cybern..

[34]  Shankar Vembu,et al.  Chemical gas sensor drift compensation using classifier ensembles , 2012 .

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Horst Bischof,et al.  Semi-Supervised Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Paulo Cortez,et al.  Using data mining for bank direct marketing: an application of the CRISP-DM methodology , 2011 .

[38]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[39]  George Michailidis,et al.  Graph-Based Semisupervised Learning , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[41]  Herna L. Viktor,et al.  Transductive Relational Classification in the Co-training Paradigm , 2012, MLDM.

[42]  E. Ford Body mass index, diabetes, and C-reactive protein among U.S. adults. , 1999, Diabetes care.

[43]  G. De’ath,et al.  CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS , 2000 .

[44]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[45]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[46]  Saso Dzeroski,et al.  Constraint Based Induction of Multi-objective Regression Trees , 2005, KDID.

[47]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..

[48]  Fei Wang,et al.  Graph-based semi-supervised learning , 2009, Artificial Life and Robotics.

[49]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[50]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[51]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[52]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[53]  Michelangelo Ceci,et al.  A relational approach to probabilistic classification in a transductive setting , 2009, Eng. Appl. Artif. Intell..

[54]  Fabio Gagliardi Cozman,et al.  Unlabeled Data Can Degrade Classification Performance of Generative Classifiers , 2002, FLAIRS.