A novel SVM modeling approach for highly imbalanced and overlapping classification

Traditional classification algorithms can be limited in their performance on highly imbalanced and overlapping data sets, In this paper, we focus on modifying support vector machines (SVMs) to make it suitable for highly imbalanced and overlapping (HIO) classification. Based on the analysis of most SVM learning algorithms for imbalanced classification, we argue that in SVM-based algorithms, due to the linearity property of SVM, the key problem is that the increase of the number of correctly predicted minority samples will lead to even more majority samples be misclassified. Then a novel algorithm HIO-SVM is developed, it can recognize all minority samples while minimizing the error rate of majority ones. The proposed approach can identify the non-overlapping samples in one feature space, furthermore, by iteratively shifting kernel spaces, all non-overlapping samples in different kernel spaces are recognized. Because of the highly imbalanced distribution, the remaining overlapping samples can be regarded as minority. Then all minority samples can be predicted correctly and the error rate of majority samples can be guaranteed minimized simultaneously. Finally, numerous case studies show the properties and effectiveness of the proposed HIO-SVM algorithm.

[1]  Adam Kowalczyk,et al.  One class SVM for yeast regulation prediction , 2002, SKDD.

[2]  S. Mahadevan,et al.  Learning Theory , 2001 .

[3]  Kaizhu Huang,et al.  Imbalanced learning with a biased minimax probability machine , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[5]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[6]  Francisco Herrera,et al.  A genetic learning of the fuzzy rule-based classification system granularity for highly imbalanced data-sets , 2009, 2009 IEEE International Conference on Fuzzy Systems.

[7]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Ulf Brefeld,et al.  {AUC} maximizing support vector learning , 2005 .

[9]  Bao-Liang Lu,et al.  Learning Imbalanced Data Sets with a Min-Max Modular Support Vector Machine , 2007, 2007 International Joint Conference on Neural Networks.

[10]  Sotiris B. Kotsiantis,et al.  Stacking Cost Sensitive Models , 2008, 2008 Panhellenic Conference on Informatics.

[11]  Nathalie Japkowicz,et al.  Boosting Support Vector Machines for Imbalanced Data Sets , 2008, ISMIS.

[12]  Hong Guo,et al.  Neural Learning from Unbalanced Data , 2004, Applied Intelligence.

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[17]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[18]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[19]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[20]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[21]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[22]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[23]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[24]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  He-Yong Wang,et al.  Combination approach of SMOTE and biased-SVM for imbalanced datasets , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[26]  Nenghai Yu,et al.  Learning object from small and imbalanced dataset with Boost-BFKO , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[27]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[28]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[29]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[30]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.

[31]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[32]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[33]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[34]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[35]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[36]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[37]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[38]  Zhi-Hua Zhou,et al.  ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[39]  Lei Wang,et al.  AdaBoost with SVM-based component classifiers , 2008, Eng. Appl. Artif. Intell..

[40]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.