Radial-Based Oversampling for Multiclass Imbalanced Data Classification

Learning from imbalanced data is among the most popular topics in the contemporary machine learning. However, the vast majority of attention in this field is given to binary problems, while their much more difficult multiclass counterparts are relatively unexplored. Handling data sets with multiple skewed classes poses various challenges and calls for a better understanding of the relationship among classes. In this paper, we propose multiclass radial-based oversampling (MC-RBO), a novel data-sampling algorithm dedicated to multiclass problems. The main novelty of our method lies in using potential functions for generating artificial instances. We take into account information coming from all of the classes, contrary to existing multiclass oversampling approaches that use only minority class characteristics. The process of artificial instance generation is guided by exploring areas where the value of the mutual class distribution is very small. This way, we ensure a smart oversampling procedure that can cope with difficult data distributions and alleviate the shortcomings of existing methods. The usefulness of the MC-RBO algorithm is evaluated on the basis of extensive experimental study and backed-up with a thorough statistical analysis. Obtained results show that by taking into account information coming from all of the classes and conducting a smart oversampling, we can significantly improve the process of learning from multiclass imbalanced data.

[1]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[2]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[3]  Luís Torgo,et al.  Relevance-Based Evaluation Metrics for Multi-class Imbalanced Domains , 2017, PAKDD.

[4]  Osmar R. Zaïane,et al.  Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[5]  Jie Zhou,et al.  Multi-class learning using data driven ECOC with deep search and re-balancing , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[6]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[7]  Francisco Herrera,et al.  DRCW-ASEG: One-versus-One distance-based relative competence weighting with adaptive synthetic example generation for multi-class imbalanced datasets , 2018, Neurocomputing.

[8]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[9]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[10]  Nitesh V. Chawla,et al.  Building Decision Trees for the Multi-class Imbalance Problem , 2012, PAKDD.

[11]  Robert Sabourin,et al.  The Multiclass ROC Front method for cost-sensitive classification , 2016, Pattern Recognit..

[12]  Pedro Antonio Gutiérrez,et al.  Oversampling the Minority Class in the Feature Space , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Dazhe Zhao,et al.  ℓ2, 1 Norm Regularized Multi-kernel Based Joint Nonlinear Feature Selection and Over-sampling for Imbalanced Data Classification , 2017, Neurocomputing.

[14]  Swagatam Das,et al.  Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs , 2015, Neural Networks.

[15]  Witold Pedrycz,et al.  Transfer Boosting With Synthetic Instances for Class Imbalanced Object Recognition , 2018, IEEE Transactions on Cybernetics.

[16]  Bartosz Krawczyk,et al.  Radial-Based oversampling for noisy imbalanced data classification , 2019, Neurocomputing.

[17]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[18]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Francisco Herrera,et al.  Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data , 2016, Knowl. Based Syst..

[21]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[22]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[23]  Nathalie Japkowicz,et al.  Manifold-based synthetic oversampling with manifold conformance estimation , 2018, Machine Learning.

[24]  Michal Wozniak,et al.  CCR: A combined cleaning and resampling algorithm for imbalanced data classification , 2017, Int. J. Appl. Math. Comput. Sci..

[25]  Jerzy Stefanowski,et al.  Identification of Different Types of Minority Class Examples in Imbalanced Data , 2012, HAIS.

[26]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[27]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[28]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[29]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[30]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[31]  Drazen Prelec,et al.  A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data , 2018, Neurocomputing.

[32]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[33]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[34]  Shiguang Shan,et al.  Multiset Feature Learning for Highly Imbalanced Data Classification , 2017, AAAI.

[35]  Yue Xu,et al.  Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets , 2018, Inf. Sci..

[36]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[37]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[38]  Pedro Antonio Gutiérrez,et al.  A dynamic over-sampling procedure based on sensitivity for multi-class problems , 2011, Pattern Recognit..

[39]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[40]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[41]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[42]  Jerzy Stefanowski,et al.  Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data , 2018, Journal of Intelligent Information Systems.

[43]  Chee Khiang Pang,et al.  Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[44]  Bartosz Krawczyk Cost-sensitive one-vs-one ensemble for multi-class imbalanced data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[45]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[46]  Liu Xiao,et al.  BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification , 2016 .

[47]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..