Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification

We proposed an improved weighted SVM for imbalanced classification.Different factor scores are computed by categorizing instances based on the SVM margin.Numerical experiments showed the proposed method outperforming the existing approaches in terms of F-measure AUC. To address class imbalance in data, we propose a new weight adjustment factor that is applied to a weighted support vector machine (SVM) as a weak learner of the AdaBoost algorithm. Different factor scores are computed by categorizing instances based on the SVM margin and are assigned to related instances. The SVM margin is used to define borderline and noisy instances, and the factor scores are assigned to only borderline instances and positive noise. The adjustment factor is then employed as a multiplier to the instance weight in the AdaBoost algorithm when learning a weighted SVM. Using 10 real class-imbalanced datasets, we compare the proposed method to a standard SVM and other SVMs combined with various sampling and boosting methods. Numerical experiments show that the proposed method outperforms existing approaches in terms of F-measure and area under the receiver operating characteristic curve, which means that the proposed method is useful for relaxing the class-imbalance problem by addressing well-known degradation issues such as overlap, small disjunct, and data shift problems.

[1]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[2]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[3]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[4]  Nathalie Japkowicz,et al.  Boosting Support Vector Machines for Imbalanced Data Sets , 2008, ISMIS.

[5]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[6]  Łukasz Wróbel,et al.  Application of Rule Induction Algorithms for Analysis of Data Collected by Seismic Hazard Monitoring Systems in Coal Mines , 2010 .

[7]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..

[8]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[9]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[10]  Robert Sabourin,et al.  Adaptive ROC-based ensembles of HMMs applied to anomaly detection , 2012, Pattern Recognit..

[11]  Jing Zhang,et al.  Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data , 2016, Pattern Recognit. Lett..

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[14]  Jian-Jiun Ding,et al.  Facial age estimation based on label-sensitive learning and age-oriented regression , 2013, Pattern Recognit..

[15]  Ziaul Islam,et al.  Willingness-to-Pay for Community-Based Health Insurance among Informal Workers in Urban Bangladesh , 2016, PloS one.

[16]  Francisco Herrera,et al.  Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets , 2016, Inf. Sci..

[17]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[18]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[19]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[20]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[21]  José Martínez Sotoca,et al.  When Overlapping Unexpectedly Alters the Class Imbalance Effects , 2007, IbPRIA.

[22]  Lei Wang,et al.  AdaBoost with SVM-based component classifiers , 2008, Eng. Appl. Artif. Intell..

[23]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[24]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[25]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[26]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[27]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[28]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[29]  Francisco Herrera,et al.  Fuzzy rough classifiers for class imbalanced multi-instance data , 2016, Pattern Recognit..

[30]  Adrião Duarte Dória Neto,et al.  Creating an ensemble of diverse support vector machines using Adaboost , 2009, 2009 International Joint Conference on Neural Networks.

[31]  Petros Xanthopoulos,et al.  A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets , 2016, Expert Syst. Appl..

[32]  Regina Berretta,et al.  Heterogeneous Ensemble Combination Search Using Genetic Algorithm for Class Imbalanced Data Classification , 2016, PloS one.

[33]  Wei Wang,et al.  Improved Algorithm for Adaboost with SVM Base Classifiers , 2006, 2006 5th IEEE International Conference on Cognitive Informatics.

[34]  Hyun-Chul Kim,et al.  Constructing support vector machine ensemble , 2003, Pattern Recognit..

[35]  J. Weston,et al.  Support Vector Machine Solvers , 2007 .

[36]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[37]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.