A multi-label approach using binary relevance and decision trees applied to functional genomics

Many classification problems, especially in the field of bioinformatics, are associated with more than one class, known as multi-label classification problems. In this study, we propose a new adaptation for the Binary Relevance algorithm taking into account possible relations among labels, focusing on the interpretability of the model, not only on its performance. Experiments were conducted to compare the performance of our approach against others commonly found in the literature and applied to functional genomic datasets. The experimental results show that our proposal has a performance comparable to that of other methods and that, at the same time, it provides an interpretable model from the multi-label problem.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[5]  Kei-Hoi Cheung,et al.  TRIPLES: a database of gene function in Saccharomyces cerevisiae , 2000, Nucleic Acids Res..

[6]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[7]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[8]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[9]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[12]  Sholom M. Weiss,et al.  Predictive data mining - a practical guide , 1997 .

[13]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[14]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[15]  Ester Bernadó-Mansilla,et al.  The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.

[16]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[17]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[18]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[19]  R D Appel,et al.  Protein identification and analysis tools in the ExPASy server. , 1999, Methods in molecular biology.

[20]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[21]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[22]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[23]  Michelangelo Ceci,et al.  Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction , 2013, BMC Bioinformatics.

[24]  Alex Alves Freitas,et al.  Multi-label Hierarchical Classification of Protein Functions with Artificial Immune Systems , 2008, BSB.

[25]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[26]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[27]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[28]  Sun-Yuan Kung,et al.  R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. , 2014, Journal of theoretical biology.

[29]  D. Botstein,et al.  Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. , 2001, Molecular biology of the cell.

[30]  Josef Kittler,et al.  A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling , 2009, MCS.

[31]  Einoshin Suzuki,et al.  Bloomy Decision Tree for Multi-objective Classification , 2001, PKDD.

[32]  Pericles A. Mitkas,et al.  Multi Level Clustering of Phylogenetic Profiles , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[33]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[34]  Ian Witten,et al.  Data Mining , 2000 .

[35]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[36]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[37]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[38]  S. Oliver A network approach to the systematic analysis of yeast gene function. , 1996, Trends in genetics : TIG.

[39]  Pericles A. Mitkas,et al.  Multi-genome Core Pathway Identification through Gene Clustering , 2012, AIAI.