Dealing with Imbalanceness in Hierarchical Multi-Label Datasets Using Multi-Label Resampling Techniques

The task of learning from imbalanced datasets has been widely investigated in the binary, multi-class and multilabel scenarios. Although this problem also affects hierarchical datasets, to the best of our knowledge, there are no works in the literature that deal with imbalanceness in hierarchical contexts. In this paper we propose metrics to measure "how imbalanced" is a Hierarchical Multi-Label Dataset, in addition to an approach to deal with this imbalanceness using Multi-Label resampling techniques. The proposed technique is based on the conversion of the dataset labels to a strictly multi-label format, applying wellknown multi-label resampling techniques and then converting the dataset back to its hierarchical taxonomy. The experimental evaluation over a highly imbalanced Music Genre Recognition dataset achieved promising results, with an increase of 0.2337 in the Avg-AUROC metric in relation to the original dataset.

[1]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[2]  Vasant Honavar,et al.  Learning Classifiers Using Hierarchically Structured Class Taxonomies , 2005, SARA.

[3]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[4]  Saso Dzeroski,et al.  Decision Trees for Hierarchical Multilabel Classification: A Case Study in Functional Genomics , 2006, PKDD.

[5]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[6]  Francisco Charte,et al.  MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation , 2015, Knowl. Based Syst..

[7]  Vince Grolmusz,et al.  SECLAF: a webserver and deep neural network design tool for hierarchical biological sequence classification , 2018, Bioinform..

[8]  Laura K. Allen,et al.  A Hierarchical Classification Approach to Automated Essay Scoring. , 2015 .

[9]  A. J. Rivera,et al.  A First Approach to Deal with Imbalance in Multi-label Datasets , 2013, HAIS.

[10]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[11]  Xiao Li,et al.  Active Learning for Hierarchical Text Classification , 2012, PAKDD.

[12]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[13]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[14]  Francisco Charte,et al.  MLeNN: A First Approach to Heuristic Multilabel Undersampling , 2014, IDEAL.

[15]  Dengsheng Zhang,et al.  A Novel Automatic Hierachical Approach to Music Genre Classification , 2012, 2012 IEEE International Conference on Multimedia and Expo Workshops.

[16]  Saso Dzeroski,et al.  Ranking with Predictive Clustering Trees , 2002, ECML.

[17]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .