Synthetic minority oversampling technique for multiclass imbalance problems

Abstract Multiclass imbalance data learning has attracted increasing interests from the research community. Unfortunately, existing oversampling solutions, when facing this more challenging problem as compared to two-class imbalance case, have shown their respective deficiencies such as causing serious over generalization or not actively improving the class imbalance in data space. We propose a k -nearest neighbors ( k -NN)-based synthetic minority oversampling algorithm, termed SMOM, to handle multiclass imbalance problems. Different from previous k -NN-based oversampling algorithms, where for any original minority instance the synthetic instances are randomly generated in the directions of its k -nearest neighbors, SMOM assigns a selection weight to each neighbor direction. The neighbor directions that can produce serious over generalization will be given small selection weights. This way, SMOM forms a mechanism of avoiding over generalization as the safer neighbor directions are more likely to be selected to yield the synthetic instances. Owing to this, SMOM can aggressively explore the regions of minority classes by configuring a high value for parameter k , but do not result in severe over generalization. Extensive experiments using 27 real-world data sets demonstrate the effectiveness of our algorithm.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[3]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[6]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[7]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[8]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[9]  Kishan G. Mehrotra,et al.  Efficient classification for multiclass problems using modular neural networks , 1995, IEEE Trans. Neural Networks.

[10]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[11]  Hui Xiong,et al.  COG: local decomposition for rare class analysis , 2010, Data Mining and Knowledge Discovery.

[12]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[13]  Gregory W. Corder,et al.  Nonparametric Statistics : A Step-by-Step Approach , 2014 .

[14]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[15]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[16]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[18]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[19]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  Iman Nekooeimehr,et al.  Cluster-based Weighted Oversampling for Ordinal Regression (CWOS-Ord) , 2016, Neurocomputing.

[21]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[22]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[24]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[25]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[26]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[27]  Jerzy Stefanowski,et al.  Identification of Different Types of Minority Class Examples in Imbalanced Data , 2012, HAIS.

[28]  See-Kiong Ng,et al.  Integrated Oversampling for Imbalanced Time Series Classification , 2013, IEEE Transactions on Knowledge and Data Engineering.