Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets

Canonical machine learning algorithms assume that the number of objects in the considered classes are roughly similar. However, in many real-life situations the distribution of examples is skewed since the examples of some of the classes appear much more frequently. This poses a difficulty to learning algorithms, as they will be biased towards the majority classes. In recent years many solutions have been proposed to tackle imbalanced classification, yet they mainly concentrate on binary scenarios. Multi-class imbalanced problems are far more difficult as the relationships between the classes are no longer straightforward. Additionally, one should analyze not only the imbalance ratio but also the characteristics of the objects within each class. In this paper we present a study on oversampling for multi-class imbalanced datasets that focuses on the analysis of the class characteristics. We detect subsets of specific examples in each class and fix the oversampling for each of them independently. Thus, we are able to use information about the class structure and boost the more difficult and important objects. We carry an extensive experimental analysis, which is backed-up with statistical analysis, in order to check when the preprocessing of some types of examples within a class may improve the indiscriminate preprocessing of all the examples in all the classes. The results obtained show that oversampling concrete types of examples may lead to a significant improvement over standard multi-class preprocessing that do not consider the importance of example types. Graphical abstractDisplay Omitted HighlightsA thorough analysis of oversampling for handling multi-class imbalanced datasets.Proposition to detect underlying structures and example types in considered classes.Smart oversampling based on extracted knowledge about imbalance distribution types.In-depth insight into the importance of selecting proper examples for oversampling.Guidelines that allow to design efficient classifiers for multi-class imbalanced data.

[1]  Bartosz Krawczyk,et al.  Clustering-based ensembles for one-class classification , 2014, Inf. Sci..

[2]  Kaizhu Huang,et al.  Imbalanced learning with a biased minimax probability machine , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Jorma Laurikkala,et al.  Instance-based data reduction for improved identification of difficult small classes , 2002, Intell. Data Anal..

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[6]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[7]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[10]  Yaohua Tang,et al.  Improved Classification for Problem Involving Overlapping Patterns , 2007, IEICE Trans. Inf. Syst..

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.

[13]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[14]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[15]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[16]  Jerzy Stefanowski,et al.  Identification of Different Types of Minority Class Examples in Imbalanced Data , 2012, HAIS.

[17]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[18]  Nitesh V. Chawla,et al.  Building Decision Trees for the Multi-class Imbalance Problem , 2012, PAKDD.

[19]  Yuan-Hai Shao,et al.  An efficient weighted Lagrangian twin support vector machine for imbalanced data classification , 2014, Pattern Recognit..

[20]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[21]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[24]  WuXindong,et al.  Class noise vs. attribute noise , 2004 .

[25]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[26]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[27]  Edward Roy Davies Training sets and a priori probabilities with the nearest neighbour method of pattern recognition , 1988, Pattern Recognit. Lett..

[28]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[29]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[30]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[31]  Szymon Wilk,et al.  Integrating Selective Pre-processing of Imbalanced Data with Ivotes Ensemble , 2010, RSCTC.

[32]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Nuanwan Soonthornphisaj,et al.  Cluster-based sampling of multiclass imbalanced data , 2014, Intell. Data Anal..

[34]  George A. Tsihrintzis,et al.  The Class Imbalance Problem , 2017 .

[35]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[36]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[37]  Roberto Alejo,et al.  A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios , 2013, Pattern Recognit. Lett..

[38]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[39]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[40]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[41]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[42]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[43]  José Salvador Sánchez,et al.  An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets , 2007, CIARP.

[44]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[45]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[46]  Pedro Antonio Gutiérrez,et al.  A dynamic over-sampling procedure based on sensitivity for multi-class problems , 2011, Pattern Recognit..

[47]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[48]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[49]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[50]  Francisco Herrera,et al.  Analyzing the presence of noise in multi-class problems: alleviating its influence with the One-vs-One decomposition , 2012, Knowledge and Information Systems.

[51]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[52]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[53]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[54]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[55]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[56]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[57]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[58]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[59]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[60]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[61]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[62]  Zhen Liu,et al.  A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion , 2015, Neurocomputing.