An overlap-sensitive margin classifier for imbalanced and overlapping data

Abstract Classification is an important task in various areas. In many real-world applications, class imbalance and overlapping problems have been reported as major issues in the application of traditional classification algorithms. An imbalance problem occurs when training data contain considerably more representatives of one class than of other classes. Class overlap occurs when a region in the data space contains a similar number of data for each class. When a class overlap occurs in imbalanced data sets, classification becomes even more complicated. Although various approaches have been proposed to deal separately with class imbalance and overlapping problems, only a few studies have attempted to address both problems simultaneously. In this paper, we propose an overlap-sensitive margin (OSM) classifier based on a modified fuzzy support vector machine and k-nearest neighbor algorithm to address imbalanced and overlapping data sets. The main idea of the proposed OSM classifier is to separate the data space into soft- and hard-overlap regions using the modified fuzzy support vector machine algorithm. The separated spaces are then classified using the decision boundaries of the support vector machine and 1-nearest neighbor algorithms. Furthermore, by separating a data set into soft- and hard-overlap regions, one can determine which part of the data is to be examined more closely for classification in real-world situations. Experiments using synthetic and real-world data sets demonstrated that the proposed OSM classifier outperformed existing methods for imbalanced and overlapping situations.

[1]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yinghuan Shi,et al.  A Fast Distributed Classification Algorithm for Large-Scale Imbalanced Data , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[3]  Iman Nekooeimehr,et al.  Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets , 2016, Expert Syst. Appl..

[4]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[5]  Petros Xanthopoulos,et al.  A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets , 2016, Expert Syst. Appl..

[6]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[7]  Joarder Kamruzzaman,et al.  z-SVM: An SVM for Improved Classification of Imbalanced Data , 2006, Australian Conference on Artificial Intelligence.

[8]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[9]  Xiaowei Yang,et al.  A bilateral-truncated-loss based robust support vector machine for classification problems , 2015, Soft Comput..

[10]  Jian Chu,et al.  A novel SVM modeling approach for highly imbalanced and overlapping classification , 2011, Intell. Data Anal..

[11]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12]  Gee Wah Ng,et al.  Classification for overlapping classes using optimized overlapping region detection and soft decision , 2010, 2010 13th International Conference on Information Fusion.

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[15]  Yang Liu,et al.  Combining integrated sampling with SVM ensembles for learning from imbalanced datasets , 2011, Inf. Process. Manag..

[16]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[17]  Sheng Chen,et al.  Statistics local fisher discriminant analysis for industrial process fault classification , 2016, 2016 UKACC 11th International Conference on Control (CONTROL).

[18]  Hong-Liang Dai,et al.  Class imbalance learning via a fuzzy total margin based support vector machine , 2015, Appl. Soft Comput..

[19]  Ekrem Duman,et al.  A cost-sensitive decision tree approach for fraud detection , 2013, Expert Syst. Appl..

[20]  Stanislaw Osowski,et al.  Gene selection for cancer classification , 2009 .

[21]  José R. Dorronsoro,et al.  Neural fraud detection in credit card operations , 1997, IEEE Trans. Neural Networks.

[22]  Özden Gür Ali,et al.  Dynamic churn prediction framework with more effective use of rare event data: The case of private banking , 2014, Expert Syst. Appl..

[23]  Swagatam Das,et al.  Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs , 2015, Neural Networks.

[24]  Rong Pan,et al.  Mix-ratio sampling: Classifying multiclass imbalanced mouse brain images using support vector machine , 2010, Expert Syst. Appl..

[25]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[26]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[27]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[28]  Junjie Wu,et al.  Classification with Class Overlapping: A Systematic Study , 2010 .

[29]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[30]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[31]  Yu-Lin He,et al.  Fuzziness based semi-supervised learning approach for intrusion detection system , 2017, Inf. Sci..

[32]  Hongyuan Zha,et al.  Entropy-based fuzzy support vector machine for imbalanced datasets , 2017, Knowl. Based Syst..

[33]  Dazhe Zhao,et al.  An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[34]  Mahesh Chandra Govil,et al.  A comparative analysis of SVM and its stacking with other classification algorithm for intrusion detection , 2016, 2016 International Conference on Advances in Computing, Communication, & Automation (ICACCA) (Spring).

[35]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[36]  Sungzoon Cho,et al.  EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems , 2006, ICONIP.

[37]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Sulabha S. Apte,et al.  Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques , 2012 .

[39]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[40]  Igor Kononenko,et al.  Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[41]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[42]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[43]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[44]  Changming Zhu,et al.  Entropy-based matrix learning machine for imbalanced data sets , 2017, Pattern Recognit. Lett..

[45]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[46]  Jerzy Stefanowski,et al.  Addressing imbalanced data with argument based rule learning , 2015, Expert Syst. Appl..

[47]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[48]  Diane J. Cook,et al.  Handling Class Overlap and Imbalance to Detect Prompt Situations in Smart Homes , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[49]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[50]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[51]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[52]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[53]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[54]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..