On the Performance of Oversampling Techniques for Class Imbalance Problems

Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the “classical” resampling techniques. In this paper, several experiments on 19 benchmark datasets are set up to study the efficiency of six powerful oversampling approaches, including both “classical” and new ones. According to our experimental results, oversampling techniques that consider the minority class distribution (new ones) perform better in most cases and RACOG gives the best performance among the six reviewed approaches. We further validate our conclusion on our real-world inspired vehicle datasets and also find applying oversampling techniques can improve the performance by around 10%. In addition, seven data complexity measures are considered for the initial purpose of investigating the relationship between data complexity measures and the choice of resampling techniques. Although no obvious relationship can be abstracted in our experiments, we find F1v value, a measure for evaluating the overlap which most researchers ignore, has a strong negative correlation with the potential AUC value (after resampling).

[1]  Thomas W. Sederberg,et al.  Free-form deformation of solid geometric models , 1986, SIGGRAPH.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  M. Olhofer,et al.  Application of Free Form Deformation Techniques in Evolutionary Design Optimisation , 2005 .

[4]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[5]  Bernhard Sendhoff,et al.  Representing the Change - Free Form Deformation for Evolutionary Design Optimization , 2008, Evolutionary Computation in Practice.

[6]  Partick Knupp Measurement and Impact of Mesh Quality (Invited) , 2008 .

[7]  M. Olhofer,et al.  Evolutionary Optimisation of an Exhaust Flow Element with Free Form Deformation , 2009 .

[8]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[9]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[10]  Nikolaus A. Adams,et al.  Experimental and Numerical Investigation of the DrivAer Model , 2012 .

[11]  Alla Sheffer,et al.  PolyCut: monotone graph-cuts for PolyCube base-complex construction , 2013, ACM Trans. Graph..

[12]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[13]  PolyCut , 2013 .

[14]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[15]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[16]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[17]  Stefan Menzel,et al.  On Shape Deformation Techniques for Simulation-Based Design Optimization , 2015 .

[18]  David Sinclair,et al.  S-hull: a fast radial sweep-hull routine for Delaunay triangulation , 2016, ArXiv.

[19]  Jinyan Li,et al.  Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data , 2017, PloS one.

[20]  Francisco Herrera,et al.  Imbalance: Oversampling algorithms for imbalanced classification in R , 2018, Knowl. Based Syst..

[21]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[22]  Miriam Seoane Santos,et al.  Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier] , 2018, IEEE Computational Intelligence Magazine.

[23]  Jens Lehmann,et al.  How Complex Is Your Classification Problem? , 2018, ACM Comput. Surv..

[24]  Thomas Bäck,et al.  Hyperparameter Optimisation for Improving Classification under Class Imbalance , 2019, 2019 IEEE Symposium Series on Computational Intelligence (SSCI).