Comparisons of ADABOOST, KNN, SVM and Logistic Regression in Classification of Imbalanced Dataset

Data mining classification techniques are affected by the presence of imbalances between classes of a response variable. The difficulty in handling the imbalanced data issue has led to an influx of methods, either resolving the imbalance issue at data or algorithmic level. The R programming language is one of the many tools available for data mining. This paper compares some classification algorithms in R for an imbalanced medical data set. The classifiers ADABOOST, KNN, SVM-RBF and logistic regression were applied to the original, random oversampling and undersampling data sets. Results show that ADABOOST, KNN and SVM-RBF exhibits over-fitting when applied to the original dataset. No overfitting occurs for the random oversampling dataset where by SVM-RBF has the highest accuracy (Training: 91.5%, Testing: 90.6%), sensitivity (Training :91.0%, Testing: 91.0%), specificity (Training: 92.0%,Testing: 90.2%) and precision (Training:91.9%, Testing 90.5%) for training and testing data set. For random undersampling, no overfitting occurs only for ADABOOST and logistic regression. Logistic regression is the most stable classifier exhibiting consistent training an testing results.

[1]  Geoffrey J. McLachlan,et al.  Classification of Imbalanced Marketing Data with Balanced Random Sets , 2009, KDD Cup.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[4]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[5]  Nathalie Japkowicz,et al.  A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets , 2001, IDA.

[6]  Yuan-chin Ivan Chang,et al.  Boosting SVM Classifiers with Logistic Regression , 2003 .

[7]  Mohamed Bekkar,et al.  Imbalanced Data Learning Approaches Review , 2013 .

[8]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[9]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[10]  Yunqian Ma,et al.  Foundations of Imbalanced Learning , 2013 .

[11]  Francisca Nonyelum Ogwueleka DATA MINING APPLICATION IN CREDIT CARD FRAUD DETECTION SYSTEM , 2011 .

[12]  Hezlin Aryani Abd Rahman,et al.  Handling imbalanced dataset using SVM and k-NN approach , 2016 .

[13]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[14]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[15]  Jesus A. Gonzalez,et al.  Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic , 2006, FLAIRS.

[16]  Paul M. Thompson,et al.  Analysis of sampling techniques for imbalanced data: An n=648 ADNI study , 2014, NeuroImage.

[17]  Rouslan A. Moro,et al.  Support Vector Machines (SVM) as a Technique for Solvency Analysis , 2008 .

[18]  B. Sathian Reporting dichotomous data using Logistic Regression in Medical Research: The scenario in developing countries , 2012 .

[19]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[20]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[21]  Xiaoqian Jiang,et al.  Improving predictions in imbalanced data using Pairwise Expanded Logistic Regression. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[22]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[23]  C. Y. Peng,et al.  An Introduction to Logistic Regression Analysis and Reporting , 2002 .

[24]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[25]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.