RWO-Sampling: A random walk over-sampling approach to imbalanced data classification

Abstract This study investigates how to alleviate the class imbalance problems for constructing unbiased classifiers when instances in one class are more than that in another. Since keeping the data distribution unchanged and expanding class boundaries after synthetic samples have been added influence the classification performance greatly, we take into account the above two factors, and propose a Random Walk Over-Sampling approach (RWO-Sampling) to balancing different class samples by creating synthetic samples through randomly walking from the real data. When some conditions are satisfied, it can be proved that, both the expected average and the standard deviation of the generated samples equal to that of the original minority class data. RWO-Sampling also expands the minority class boundary after synthetic samples have been generated. In this work, we perform a broad experimental evaluation, and experimental results show that, RWO-Sampling statistically does much better than alternative methods on imbalanced data sets when implementing common baseline algorithms.

[1]  Fabio Roli,et al.  Intrusion detection in computer networks by a modular ensemble of one-class classifiers , 2008, Inf. Fusion.

[2]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[4]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[5]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[6]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[7]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[10]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[13]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[14]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[15]  Lei Wang,et al.  AdaBoost with SVM-based component classifiers , 2008, Eng. Appl. Artif. Intell..

[16]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[17]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[18]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[19]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[20]  Chun-Chin Hsu,et al.  An information granulation based data mining approach for classifying imbalanced data , 2008, Inf. Sci..

[21]  Xingquan Zhu,et al.  Lazy Bagging for Classifying Imbalanced Data , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[22]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[23]  Yuxin Peng,et al.  AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets , 2010, MIR '10.

[24]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[25]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[26]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  Huaxiang Zhang,et al.  A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification , 2011, ADMA.

[28]  María José del Jesús,et al.  Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets , 2009, Int. J. Approx. Reason..