A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification

A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minima. Nevertheless, existing SVMs do not perform well when facing class imbalance and large-scale samples. Undersampling is a plausible alternative to solve imbalanced problems in some way, but suffers from soaring computational complexity and reduced accuracy because of its enormous iterations and random sampling process. To improve their classification performance in dealing with data imbalance problems, this work proposes a weighted undersampling (WU) scheme for SVM based on space geometry distance, and thus produces an improved algorithm named WU-SVM. In WU-SVM, majority samples are grouped into some subregions (SRs) and assigned different weights according to their Euclidean distance to the hyper plane. The samples in an SR with higher weight have more chance to be sampled and put to use in each learning iteration, so as to retain the data distribution information of original data sets as much as possible. Comprehensive experiments are performed to test WU-SVM via 21 binary-class and six multiclass publically available data sets. The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification, i.e., area under the curve, F-Measure, and G-Mean.

[1]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[2]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[3]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[4]  Jieping Ye,et al.  A Convex Formulation for Learning a Shared Predictive Structure from Multiple Tasks , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Marzuki Khalid,et al.  Evolutionary Fuzzy ARTMAP Neural Networks for Classification of Semiconductor Defects , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Abdur Chowdhury,et al.  Avoidance of Model Re-Induction in SVM-Based Feature Selection for Text Categorization , 2007, IJCAI.

[7]  Cheng-Lung Huang,et al.  A distributed PSO-SVM hybrid system with feature selection and parameter optimization , 2008, Appl. Soft Comput..

[8]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Jian Yang,et al.  On Selecting Effective Patterns for Fast Support Vector Regression Training , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[11]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[12]  Xiaoyun Chen,et al.  A new over-sampling technique based on SVM for imbalanced diseases data , 2013, Proceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC).

[13]  Jingjing Tang,et al.  Multiview Privileged Support Vector Machines , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  MengChu Zhou,et al.  Dynamic Behavior of Artificial Hodgkin–Huxley Neuron Model Subject to Additive Noise , 2016, IEEE Transactions on Cybernetics.

[17]  MengChu Zhou,et al.  Common Bayesian Network for Classification of EEG-Based Multiclass Motor Imagery BCI , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[18]  S. Sathiya Keerthi,et al.  A fast iterative nearest point algorithm for support vector machine classifier design , 2000, IEEE Trans. Neural Networks Learn. Syst..

[19]  MengChu Zhou,et al.  A novel under-sampling algorithm based on Iterative-Partitioning Filters for imbalanced classification , 2016, 2016 IEEE International Conference on Automation Science and Engineering (CASE).

[20]  Korris Fu-Lai Chung,et al.  An Improved TA-SVM Method Without Matrix Inversion and Its Fast Implementation for Nonstationary Datasets , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[22]  Lionel Pichon,et al.  Microwave Characterization Using Least-Square Support Vector Machines , 2010, IEEE Transactions on Magnetics.

[23]  Nan Liu,et al.  Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift , 2015, Neurocomputing.

[24]  Hsiang-Chuan Liu,et al.  An improved SVM algorithm based on normalization and Liu-Transformation , 2008, 2008 International Conference on Wavelet Analysis and Pattern Recognition.

[25]  Huaguang Zhang,et al.  Weather prediction with multiclass support vector machines in the fault detection of photovoltaic system , 2017, IEEE/CAA Journal of Automatica Sinica.

[26]  Abhisek Ukil,et al.  Support Vector Machine , 2007 .

[27]  Jiu-Zhen Liang SVM multi-classifier and Web document classification , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[28]  Xiaowei Feng,et al.  Coupled cross-correlation neural network algorithm for principal singular triplet extraction of a cross-covariance matrix , 2016, IEEE/CAA Journal of Automatica Sinica.

[29]  Taghi M. Khoshgoftaar,et al.  Studying the Effect of Class Imbalance in Ocean Turbine Fault Data on Reliable State Detection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[30]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[31]  MengChu Zhou,et al.  A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification , 2017, IEEE Transactions on Cybernetics.

[32]  MengChu Zhou,et al.  A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence , 2016, Knowl. Based Syst..

[33]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[34]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[36]  Sungzoon Cho,et al.  EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems , 2006, ICONIP.

[37]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[38]  Shi Yan,et al.  Parameter Optimization of Support Vector Machine Based on Combined Algorithm of QPSO and SA , 2010, 2010 First International Conference on Pervasive Computing, Signal Processing and Applications.

[39]  MengChu Zhou,et al.  Bilevel Feature Extraction-Based Text Mining for Fault Diagnosis of Railway Systems , 2017, IEEE Transactions on Intelligent Transportation Systems.

[40]  MengChu Zhou,et al.  A Supervised Learning and Control Method to Improve Particle Swarm Optimization Algorithms , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[41]  Jing Yang,et al.  A parallel SVM training algorithm on large-scale classification problems , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[42]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[43]  Johan A. K. Suykens,et al.  Solution Path for Pin-SVM Classifiers With Positive and Negative $\tau $ Values , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[44]  Mi Liangb Development on genetic algorithm theory and its applications , 2010 .