An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets

Abstract Learning and mining from imbalanced datasets gained increased interest in recent years. One simple but efficient way to increase the performance of standard machine learning techniques on imbalanced datasets is the synthetic generation of minority samples. In this paper, a detailed, empirical comparison of 85 variants of minority oversampling techniques is presented and discussed involving 104 imbalanced datasets for evaluation. The goal of the work is to set a new baseline in the field, determine the oversampling principles leading to the best results under general circumstances, and also give guidance to practitioners on which techniques to use with certain types of datasets.

[1]  Jakub M. Tomczak,et al.  RBM-SMOTE: Restricted Boltzmann Machines for Synthetic Minority Oversampling Technique , 2015, ACIIDS.

[2]  Feng Hu,et al.  A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE , 2013 .

[3]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[4]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[5]  Li Zhang,et al.  A Re-sampling Method for Class Imbalance Learning with Credit Data , 2011, 2011 International Conference of Information Technology, Computer Engineering and Management Sciences.

[6]  Seyda Ertekin,et al.  Adaptive Oversampling for Imbalanced Data Classification , 2013, ISCIS.

[7]  Kun Jiang,et al.  A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE , 2016 .

[8]  Xiaoli Li,et al.  A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning , 2015, DASFAA.

[9]  Mantao Xu,et al.  Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding , 2006, 2006 8th international Conference on Signal Processing.

[10]  Olac Fuentes,et al.  A Distance-Based Over-Sampling Method for Learning from Imbalanced Data Sets , 2007, FLAIRS.

[11]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, ArXiv.

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  Hansoo Lee,et al.  Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions , 2017, Int. J. Fuzzy Log. Intell. Syst..

[14]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[15]  Sunil Vadera,et al.  A survey of cost-sensitive decision tree induction algorithms , 2013, CSUR.

[16]  Wing W. Y. Ng,et al.  Stochastic Sensitivity Oversampling Technique for Imbalanced Data , 2014, ICMLC.

[17]  Kitsana Waiyamai,et al.  A Pruning-Based Approach for Searching Precise and Generalized Region for Synthetic Minority Over-Sampling , 2012, PAKDD.

[18]  Pedro Antonio Gutiérrez,et al.  A dynamic over-sampling procedure based on sensitivity for multi-class problems , 2011, Pattern Recognit..

[19]  Fernando Bacao,et al.  Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning , 2017, Expert Syst. Appl..

[20]  Xiang Wang,et al.  A New Combination Sampling Method for Imbalanced Data , 2013 .

[21]  Mohammad Al Khaldy,et al.  Resampling Imbalanced Class and the Effectiveness of Feature Selection Methods for Heart Failure Dataset , 2018, ICRA 2018.

[22]  Szymon Wilk,et al.  Selective Pre-processing of Imbalanced Data for Improving Classification Performance , 2008, DaWaK.

[23]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[24]  Iman Nekooeimehr,et al.  Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[25]  Qinghua Cao,et al.  Applying Over-sampling Technique Based on Data Density and Cost-sensitive SVM to Imbalanced Learning , 2011, 2011 International Conference on Information Management, Innovation Management and Industrial Engineering.

[26]  Zhihua Cai,et al.  Classification of Imbalanced Data Sets by Using the Hybrid Re-sampling Algorithm Based on Isomap , 2009, ISICA.

[27]  Huaxiang Zhang,et al.  A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification , 2011, ADMA.

[28]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[29]  Kazuyuki Murase,et al.  ProWSyn: Proximity Weighted Synthetic Oversampling Technique for Imbalanced Data Set Learning , 2013, PAKDD.

[30]  Ma Li,et al.  CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests , 2017, BMC Bioinformatics.

[31]  Francisco Herrera,et al.  Addressing imbalanced classification with instance generation techniques: IPADE-ID , 2014, Neurocomputing.

[32]  Xiao-Li Meng,et al.  The Art of Data Augmentation , 2001 .

[33]  Yu Ding,et al.  Absent data generating classifier for imbalanced class sizes , 2015, J. Mach. Learn. Res..

[34]  M. Punithavalli,et al.  An E-SMOTE technique for feature selection in High-Dimensional Imbalanced Dataset , 2011, 2011 3rd International Conference on Electronics Computer Technology.

[35]  Zahir Tari,et al.  KRNN: k Rare-class Nearest Neighbour classification , 2017, Pattern Recognit..

[36]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[37]  Jaroslaw Stepaniuk,et al.  Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets , 2016, CISIM.

[38]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[39]  Igor Kononenko,et al.  Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[40]  Michal Wozniak,et al.  CCR: A combined cleaning and resampling algorithm for imbalanced data classification , 2017, Int. J. Appl. Math. Comput. Sci..

[41]  Gary R. Weckman,et al.  Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets , 2014, Neural Computing and Applications.

[42]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[43]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[44]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[45]  M. A. H. Farquad,et al.  Preprocessing unbalanced data using support vector machine , 2012, Decis. Support Syst..

[46]  Francisco Herrera,et al.  Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm , 2016, Eng. Appl. Artif. Intell..

[47]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[48]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[49]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[50]  Kazuyuki Murase,et al.  A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning , 2011, ICONIP.

[51]  Haruhiko Kimura,et al.  LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data , 2013, BioData Mining.

[52]  Yong Shi,et al.  Cost-Sensitive Support Vector Machine for Semi-Supervised Learning , 2013, ICCS.

[53]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[54]  Petros Xanthopoulos,et al.  A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets , 2016, Expert Syst. Appl..

[55]  Sheng Chen,et al.  PDFOS: PDF estimation based over-sampling for imbalanced two-class problems , 2014, Neurocomputing.

[56]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[57]  José Francisco Martínez Trinidad,et al.  SMOTE-D a Deterministic Version of SMOTE , 2016, MCPR.

[58]  William A. Rivera Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets , 2017, Inf. Sci..

[59]  Ke Tang,et al.  Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets , 2011, PAKDD.

[60]  Xuehua Wang,et al.  A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets , 2011, KSEM.

[61]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[62]  Nitesh V. Chawla,et al.  Learning from Imbalanced Data: Evaluation Matters , 2012 .

[63]  Jee-Hyong Lee,et al.  An over-sampling technique with rejection for imbalanced class learning , 2015, IMCOM.

[64]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[65]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[66]  Simon Fong,et al.  Adaptive multi-objective swarm fusion for imbalanced data classification , 2018, Inf. Fusion.

[67]  Jing-Yu Yang,et al.  A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction , 2014, PloS one.

[68]  Jiandong Zhong,et al.  AN ENSEMBLE ANOMALY DETECTION WITH IMBALANCED DATA BASED ON ROBOT VISION , 2016 .

[69]  Chen Qiu,et al.  A Novel Minority Cloning Technique for Cost-Sensitive Learning , 2015, Int. J. Pattern Recognit. Artif. Intell..

[70]  Eduardo F. Morales,et al.  Synthetic oversampling of Instances using Clustering , 2013, Int. J. Artif. Intell. Tools.

[71]  José Salvador Sánchez,et al.  Surrounding neighborhood-based SMOTE for learning from imbalanced data sets , 2012, Progress in Artificial Intelligence.

[72]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[73]  Fernando Bação,et al.  Effective data generation for imbalanced learning using conditional generative adversarial networks , 2018, Expert Syst. Appl..

[74]  Yunqian Ma,et al.  Imbalanced Datasets: From Sampling to Classifiers , 2013 .

[75]  Dazhe Zhao,et al.  ℓ2, 1 Norm Regularized Multi-kernel Based Joint Nonlinear Feature Selection and Over-sampling for Imbalanced Data Classification , 2017, Neurocomputing.

[76]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[77]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[78]  José Sergio Ruiz Castilla,et al.  PSO-based method for SVM classification on skewed data sets , 2017, Neurocomputing.

[79]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.