On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

The present paper investigates the influence of both the imbalance ratio and the classifier on the performance of several resampling strategies to deal with imbalanced data sets. The study focuses on evaluating how learning is affected when different resampling algorithms transform the originally imbalanced data into artificially balanced class distributions. Experiments over 17 real data sets using eight different classifiers, four resampling algorithms and four performance evaluation measures show that over-sampling the minority class consistently outperforms under-sampling the majority class when data sets are strongly imbalanced, whereas there are not significant differences for databases with a low imbalance. Results also indicate that the classifier has a very poor influence on the effectiveness of the resampling strategies.

[1]  Vasile Palade,et al.  Optimized Precision - A New Measure for Classifier Performance Evaluation , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[2]  Xue-wen Chen,et al.  Pruning support vectors for imbalanced data classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[3]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[4]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[5]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[7]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[8]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[9]  Jianpei Zhang,et al.  A novel virtual sample generation method based on Gaussian distribution , 2011, Knowl. Based Syst..

[10]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[11]  José Salvador Sánchez,et al.  Decision boundary preserving prototype selection for nearest neighbor classification , 2005, Int. J. Pattern Recognit. Artif. Intell..

[12]  Gustavo E. A. P. A. Batista,et al.  Learning with Class Skews and Small Disjuncts , 2004, SBIA.

[13]  José Salvador Sánchez Garreta,et al.  On the Use of Surrounding Neighbors for Synthetic Over-Sampling of the Minority Class , 2008 .

[14]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[15]  Taghi M. Khoshgoftaar,et al.  Evolutionary data analysis for the class imbalance problem , 2010, Intell. Data Anal..

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[18]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[19]  Nathalie Japkowicz,et al.  A Visualization-Based Exploratory Technique for Classifier Comparison with Respect to Multiple Metrics and Multiple Domains , 2008, ECML/PKDD.

[20]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[21]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[22]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[26]  Nikolaos M. Avouris,et al.  EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM , 2006, Appl. Artif. Intell..

[27]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[28]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[29]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[30]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[31]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[32]  Qinghua Hu,et al.  A comparative study on rough set based class imbalance learning , 2008, Knowl. Based Syst..

[33]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[34]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[35]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[36]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[37]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[38]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[39]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[40]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[41]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[42]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[43]  Yue-Shi Lee,et al.  Investigating the Effect of Sampling Methods for Imbalanced Data Distributions , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[44]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[45]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[46]  Paris A. Mastorocostas,et al.  An application of supervised and unsupervised learning approaches to telecommunications fraud detection , 2008, Knowl. Based Syst..

[47]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[48]  José Salvador Sánchez,et al.  Theoretical Analysis of a Performance Measure for Imbalanced Data , 2010, 2010 20th International Conference on Pattern Recognition.

[49]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[50]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[51]  Javier Pérez-Rodríguez,et al.  Class imbalance methods for translation initiation site recognition in DNA sequences , 2012, Knowl. Based Syst..

[52]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[53]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[54]  Nicolás García-Pedrajas,et al.  Class Imbalance Methods for Translation Initiation Site Recognition , 2010, IEA/AIE.