Dealing with class imbalance in classifier chains via random undersampling

Abstract Class imbalance is an intrinsic characteristic of multi-label data. Most of the labels in multi-label data sets are associated with a small number of training examples, much smaller compared to the size of the data set. Class imbalance poses a key challenge that plagues most multi-label learning methods. Ensemble of Classifier Chains (ECC), one of the most prominent multi-label learning methods, is no exception to this rule, as each of the binary models it builds is trained from all positive and negative examples of a label. To make ECC resilient to class imbalance, we first couple it with random undersampling. We then present two extensions of this basic approach, where we build a varying number of binary models per label and construct chains of different sizes, in order to improve the exploitation of majority examples with approximately the same computational budget. Experimental results on 16 multi-label datasets demonstrate the effectiveness of the proposed approaches in a variety of evaluation metrics.

[1]  Qingyao Wu,et al.  Online Adaptive Asymmetric Active Learning for Budgeted Imbalanced Data , 2018, KDD.

[2]  Francisco Charte,et al.  MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation , 2015, Knowl. Based Syst..

[3]  A. J. Rivera,et al.  A First Approach to Deal with Imbalance in Multi-label Datasets , 2013, HAIS.

[4]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[5]  Ken Chen,et al.  Efficient Classification of Multi-label and Imbalanced Data using Min-Max Modular Classifiers , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[6]  Baoyuan Wu,et al.  Constrained Submodular Minimization for Missing Labels and Class Imbalance in Multi-label Learning , 2016, AAAI.

[7]  Lei Tang,et al.  Large scale multi-label classification via metalabeler , 2009, WWW '09.

[8]  Cunhe Li,et al.  Improvement of Learning Algorithm for the Multi-instance Multi-label RBF Neural Networks Trained with Imbalanced Samples , 2013, J. Inf. Sci. Eng..

[9]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[10]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[11]  Hong Cheng,et al.  Pseudo labels for imbalanced multi-label learning , 2014, 2014 International Conference on Data Science and Advanced Analytics (DSAA).

[12]  Houfeng Wang,et al.  Towards Label Imbalance in Multi-label Classification with Many Labels , 2016, ArXiv.

[13]  Grigorios Tsoumakas,et al.  Making Classifier Chains Resilient to Class Imbalance , 2018, ACML.

[14]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[15]  Christos A. Papachristou,et al.  Multi-label imbalanced data enrichment process in neural net classifier training , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[16]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[17]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[18]  Francisco Charte,et al.  Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels , 2015, HAIS.

[19]  Josef Kittler,et al.  Inverse random under sampling for class imbalance problem and its application to multi-label classification , 2012, Pattern Recognit..

[20]  Chong Ho Lee,et al.  Addressing class-imbalance in multi-label learning via two-stage multi-label hypernetwork , 2017, Neurocomputing.

[21]  Adil Mehmood Khan,et al.  Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks , 2017, 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).

[22]  Grigorios Tsoumakas,et al.  On the Stratification of Multi-label Data , 2011, ECML/PKDD.

[23]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[24]  Qiang Yang,et al.  Test strategies for cost-sensitive decision trees , 2006, IEEE Transactions on Knowledge and Data Engineering.

[25]  Michael K. Ng,et al.  Oversampling for Imbalanced Data via Optimal Transport , 2019, AAAI.

[26]  Rebecca A. O'Leary,et al.  Classification and Regression Tree and Spatial Analyses Reveal Geographic Heterogeneity in Genome Wide Linkage Study of Indian Visceral Leishmaniasis , 2010, PloS one.

[27]  Xu-Ying Liu,et al.  Towards Class-Imbalance Aware Multi-Label Learning , 2015, IEEE Transactions on Cybernetics.

[28]  Dimitris N. Metaxas,et al.  Addressing Imbalance in Multi-Label Classification Using Structured Hellinger Forests , 2017, AAAI.

[29]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[30]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[32]  Josephine Sarpong Akosa,et al.  Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data , 2017 .

[33]  Grigorios Tsoumakas,et al.  Multi-target regression via input space expansion: treating targets as inputs , 2012, Machine Learning.

[34]  Yannis Papanikolaou,et al.  Multi-label active learning: key issues and a novel query strategy , 2017, Evolving Systems.

[35]  Dazhe Zhao,et al.  Cost Sensitive Ranking Support Vector Machine for Multi-label Data Learning , 2016, HIS.

[36]  Francisco Charte,et al.  MLeNN: A First Approach to Heuristic Multilabel Undersampling , 2014, IDEAL.

[37]  Francesca Mangili,et al.  Should We Really Use Post-Hoc Tests Based on Mean-Ranks? , 2015, J. Mach. Learn. Res..

[38]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[39]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[40]  Miroslav Kubat,et al.  Undersampling Approach for Imbalanced Training Sets and Induction from Multi-label Text-Categorization Domains , 2009, PAKDD Workshops.