Evaluation of Distance Measures for Hierarchical Multi-Label Classification in Functional Genomics

Hierarchical multi-label classification (HMLC) is a variant of classification where instances may belong to multiple classes that are organized in a hierarchy. The approach we used is based on decision trees and is set in the predictive clustering trees framework (PCTs), which is implemented in the CLUS system. In this work, we are investigating how different distance measures for hierarchies influence the predictive performance of the PCTs. The distance measures that we consider include weghted Euclidean distance, Jaccard, SimGIC and ImageCLEF distance. We use datasets from the area of functional genomics to evaluate the performance of the PCTs with different distances. The datasets describe different functions of the genes in the genomes of two well-studied organisms: S. Cerevisiae and A. Thaliana. We use precision-recall curves as an evaluation metric for the predictive performance. The results from the Friedman test for statistical significance suggest that there is no statistical significance in the performance.

[1]  Saso Dzeroski,et al.  Hierarchical annotation of medical images , 2011, Pattern Recognit..

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[4]  Juho Rousu,et al.  Learning hierarchical multi-category text classification models , 2005, ICML.

[5]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[6]  Saso Dzeroski,et al.  Ensembles of Multi-Objective Decision Trees , 2007, ECML.

[7]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[8]  Saso Dzeroski,et al.  Constraint Based Induction of Multi-objective Regression Trees , 2005, KDID.

[9]  Catia Pesquita,et al.  Evaluating GO-based Semantic Similarity Measures , 2007 .

[10]  Alex A. Freitas,et al.  A review of performance evaluation measures for hierarchical classifiers , 2007 .

[11]  Bernard Ženko,et al.  Learning Predictive Clustering Rules , 2005, Informatica.

[12]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[13]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[14]  Dmitrij Frishman,et al.  MIPS: a database for protein sequences and complete genomes , 1998, Nucleic Acids Res..

[15]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[16]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..