Comparing SVM ensembles for imbalanced datasets

Real life datasets often suffer from the problem of class imbalance, which thwarts supervised learning process. In such data sets examples of positive (minority) class are significantly less than those of negative (majority) class leading to severe class imbalance. Constructing high quality classifiers for such imbalanced training data sets is one of the major challenges in machine learning, since traditional classification algorithms tend to get biased towards majority class. In this paper, we compare three ensemble based approaches for handling imbalanced datasets. All the three approaches aim to overcome the under representation of minority class by learning from each of the minority class samples and a subset of majority class samples. The three approaches namely, PARTEN, UMjC and LFM were evaluated on several public datasets. Precision, recall, F-measure, g-mean and ROC space measures were used for comparison. Thread-bare discussion of the results is presented in the paper. Subsequently, we present an astronomy application, where the three methods are compared for prediction of class II, IIn and IIp supernovae.

[1]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[3]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[4]  Canada.,et al.  Data Mining and Machine Learning in Astronomy , 2009, 0906.2173.

[5]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[6]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[7]  Lior Rokach,et al.  Data Mining with Decision Trees - Theory and Applications , 2007, Series in Machine Perception and Artificial Intelligence.

[8]  S. G. Djorgovski,et al.  Automated probabilistic classification of transients and variables , 2008, 0802.3199.

[9]  Vasudha Bhatnagar,et al.  An efficient classifier ensemble using SVM , 2009, 2009 Proceeding of International Conference on Methods and Models in Computer Science (ICM2CS).

[10]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Shigeo Abe Support Vector Machines for Pattern Classification , 2010, Advances in Pattern Recognition.

[12]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[13]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[14]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[17]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[18]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[19]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[20]  Kirk D. Borne,et al.  Scientific Data Mining in Astronomy , 2009, Next Generation of Data Mining.

[21]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[22]  C. Donalek,et al.  Towards Real-time Classification of Astronomical Transients , 2008, 0810.4527.

[23]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[24]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[27]  Tan Yee Fan,et al.  A Tutorial on Support Vector Machine , 2009 .

[28]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[29]  Yan-Shi Dong,et al.  Text classification based on data partitioning and parameter varying ensembles , 2005, SAC '05.

[30]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[31]  R. A. Mollineda,et al.  The class imbalance problem in pattern classification and learning , 2009 .

[32]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.