A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets

When dealing with the imbalanced datasets (IDS), the hyperplane of Support vector machine (SVM) tends to minority class (positive class), which causes low classification accuracy. Aiming at this problem, we propose a novel differential evolution-clustering hybrid resampling SVM algorithm (DEC-SVM). This algorithm utilizes the similar mutation and crossover operators of Differential Evolution (DE) for over-sampling to enlarge the ratio of positive samples, and then we apply clustering to the over-sampled training dataset as a data cleaning method for both classes, removing the redundant or noisy samples. Experimental results show that our method DEC-SVM performs better, compared with standard SVM, SMOTE-SVM and DE-SVM under the criterion of F-measure and ROC Area (AUC) upon ten different UCI standard datasets.

[1]  Li Zhu,et al.  Data Mining on Imbalanced Data Sets , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[2]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[3]  Wang He-yong Imbalance Data Set Classification Using SMOTE and Biased-SVM , 2008 .

[4]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[5]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[6]  Filiberto Pla,et al.  A Stochastic Approach to Wilson's Editing Algorithm , 2005, IbPRIA.

[7]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[8]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[9]  Yi Lu Murphey,et al.  SVM learning from large training data set , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[10]  Peng Li,et al.  A Hybrid Re-sampling Method for SVM Learning from Imbalanced Data Sets , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[11]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[12]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[13]  R. Storn,et al.  Differential Evolution - A simple and efficient adaptive scheme for global optimization over continuous spaces , 2004 .

[14]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[17]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[18]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[19]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[20]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[21]  Peng Xiyuan A New Support Vector Machine Method for Unbalanced Data Treatment , 2006 .

[22]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[23]  Xue-wen Chen,et al.  Pruning support vectors for imbalanced data classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[24]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.