Inducing Hierarchical Multi-label Classification rules with Genetic Algorithms

Abstract Hierarchical Multi-Label Classification is a challenging classification task where the classes are hierarchically structured, with superclass and subclass relationships. It is a very common task, for instance, in Protein Function Prediction, where a protein can simultaneously perform multiple functions. In these tasks it is very difficult to achieve a high predictive performance, since hundreds or even thousands of classes with imbalanced data distributions have to be considered. In addition, the models should ideally be easily interpretable to allow the validation of the knowledge extracted from the data. This work proposes and investigates the use of Genetic Algorithms to induce rules that are both hierarchical and multi-label. Several experiments with different fitness functions and genetic operators are preformed to obtain different Hierarchical Multi-Label Classification rules. The different proposed configurations of Genetic Algorithms are evaluated together with state-of-the-art methods for HMC rule induction based on Ant Colony Optimization and Predictive Clustering Trees, using many datasets related to the Protein Function Prediction task. The experimental results show that it is possible to recommend the best configuration in terms of predictive performance and model interpretability.

[1]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[3]  D. Botstein,et al.  Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. , 2001, Molecular biology of the cell.

[4]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Reduction strategies for hierarchical multi-label classification in protein function prediction , 2016, BMC Bioinformatics.

[5]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Hierarchical multi-label classification using local neural networks , 2014, J. Comput. Syst. Sci..

[7]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[9]  Hailong Zhu,et al.  Predicting protein functions using incomplete hierarchical labels , 2015, BMC Bioinformatics.

[10]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[11]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[12]  Xin Yao,et al.  Towards an analytic framework for analysing the computation time of evolutionary algorithms , 2003, Artif. Intell..

[13]  Kei-Hoi Cheung,et al.  TRIPLES: a database of gene function in Saccharomyces cerevisiae , 2000, Nucleic Acids Res..

[14]  Michelangelo Ceci,et al.  Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction , 2013, BMC Bioinformatics.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[17]  Alex Alves Freitas,et al.  Top-Down Hierarchical Ensembles of Classifiers for Predicting G-Protein-Coupled-Receptor Functions , 2008, BSB.

[18]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[19]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[20]  Alex Alves Freitas,et al.  Evolving relational hierarchical classification rules for predicting gene ontology-based protein functions , 2014, GECCO.

[21]  Gisele L. Pappa,et al.  HCGA: A genetic algorithm for hierarchical classification , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Rodrigo C. Barros,et al.  Hierarchical multi-label classification with chained neural networks , 2017, SAC.

[24]  Fabio Roli,et al.  Threshold optimisation for multi-label classifiers , 2013, Pattern Recognit..

[25]  Nicolò Cesa-Bianchi,et al.  Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference , 2012, Machine Learning.

[26]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[27]  Nicolò Cesa-Bianchi,et al.  Hierarchical Cost-Sensitive Algorithms for Genome-Wide Gene Function Prediction , 2009, MLSB.

[28]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[29]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[30]  James T. Kwok,et al.  Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Nada Lavrač,et al.  Relational Data Mining , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[32]  Yangyang Zhao,et al.  Hierarchical Multilabel Classification with Optimal Path Prediction , 2016, Neural Processing Letters.

[33]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[34]  Celine Vens,et al.  Labelling strategies for hierarchical multi-label classification techniques , 2016, Pattern Recognit..

[35]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[36]  Alex Alves Freitas,et al.  A hierarchical multi-label classification ant colony algorithm for protein function prediction , 2010, Memetic Comput..

[37]  R D Appel,et al.  Protein identification and analysis tools in the ExPASy server. , 1999, Methods in molecular biology.

[38]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.