An Extensive Evaluation of Decision Tree–Based Hierarchical Multilabel Classification Methods and Performance Measures

Hierarchical multilabel classification is a complex classification problem where an instance can be assigned to more than one class simultaneously, and these classes are hierarchically organized with superclasses and subclasses, that is, an instance can be classified as belonging to more than one path in the hierarchical structure. This article experimentally analyses the behavior of different decision tree–based hierarchical multilabel classification methods based on the local and global classification approaches. The approaches are compared using distinct hierarchy‐based and distance‐based evaluation measures, when they are applied to a variation of real multilabel and hierarchical datasets' characteristics. Also, the different evaluation measures investigated are compared according to their degrees of consistency, discriminancy, and indifferency. As a result of the experimental analysis, we recommend the use of the global classification approach and suggest the use of the Hierarchical Precision and Hierarchical Recall evaluation measures.

[1]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Hierarchical Multilabel Protein Function Prediction Using Local Neural Networks , 2011, BSB.

[2]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[3]  Juho Rousu,et al.  Kernel-Based Learning of Hierarchical Multilabel Classification Models , 2006, J. Mach. Learn. Res..

[4]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[5]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[6]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  New top-down methods using SVMs for Hierarchical Multilabel Classification problems , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[7]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[8]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[9]  Alex Alves Freitas,et al.  Multi-label Hierarchical Classification of Protein Functions with Artificial Immune Systems , 2008, BSB.

[10]  A. Mayne,et al.  Hierarchically classifying documents with multiple labels , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[11]  Alex Alves Freitas,et al.  Adapting non-hierarchical multilabel classification methods for hierarchical multilabel classification , 2011, Intell. Data Anal..

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  Alex Alves Freitas,et al.  A hierarchical multi-label classification ant colony algorithm for protein function prediction , 2010, Memetic Comput..

[14]  C. Borror Nonparametric Statistical Methods, 2nd, Ed. , 2001 .

[15]  Douglas A. Wolfe,et al.  Nonparametric Statistical Methods , 1973 .

[16]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[17]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[18]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[19]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[20]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[21]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[22]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles , 2009, MCS.

[23]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[24]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[25]  James T. Kwok,et al.  MultiLabel Classification on Tree- and DAG-Structured Hierarchies , 2011, ICML.

[26]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[27]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[28]  Michelangelo Ceci,et al.  Classifying web documents in a hierarchy of categories: a comprehensive study , 2007, Journal of Intelligent Information Systems.

[29]  Alex Alves Freitas,et al.  Knowledge discovery with Artificial Immune Systems for hierarchical multi-label classification of protein functions , 2010, International Conference on Fuzzy Systems.

[30]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[31]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[32]  Kurt Hornik,et al.  Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[33]  Saso Dzeroski,et al.  Hierarchical Multi-classification with Predictive Clustering Trees in Functional Genomics , 2005, EPIA.

[34]  Alex Alves Freitas,et al.  A Tutorial on Multi-label Classification Techniques , 2009, Foundations of Computational Intelligence.

[35]  Michael I. Jordan,et al.  Consistent probabilistic outputs for protein function prediction , 2008, Genome Biology.

[36]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[37]  Duane Szafron,et al.  Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[38]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[39]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[40]  Fernando Benites,et al.  An Empirical Comparison of Flat and Hierarchical Performance Measures for Multi-Label Classification with Hierarchy Extraction , 2011, KES.

[41]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[42]  Nicolò Cesa-Bianchi,et al.  HCGene: a software tool to support the hierarchical classification of genes , 2008, Bioinform..

[43]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[44]  Alex A. Freitas,et al.  A Tutorial on Hierarchical Classification with Applications in Bioinformatics. , 2007 .

[45]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[46]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Yoram Singer,et al.  Large margin hierarchical classification , 2004, ICML.

[48]  Stan Matwin,et al.  Hierarchical Text Categorization as a Tool of Associating Genes with Gene Ontology Codes , 2004 .

[49]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[50]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[51]  Saso Dzeroski,et al.  Predicting Structured Outputs k-Nearest Neighbours Method , 2011, Discovery Science.

[52]  Maurice Bruynooghe,et al.  Hierarchical multi-classification , 2002, KDD 2002.

[53]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[54]  Alex A. Freitas,et al.  HIERARCHICAL CLASSIFICATION OF G-PROTEIN-COUPLED RECEPTORS WITH A PSO/ACO ALGORITHM , 2006 .

[55]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[56]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.