An Empirical Study of Bagging Predictors for Imbalanced Data with Different Levels of Class Distribution

Research into learning from imbalanced data has increasingly captured the attention of both academia and industry, especially when the class distribution is highly skewed. This paper compares the Area Under the Receiver Operating Characteristic Curve (AUC ) performance of bagging in the context of learning from different imbalanced levels of class distribution. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, e.g., which bagging predictors may achieve the best performance for applications, and whether bagging is superior to single learners when the levels of class distribution change. We perform a comprehensive evaluation of the AUC performance of bagging predictors with 12 base learners at different imbalanced levels of class distribution by using a sampling technique on 14 imbalanced data-sets. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. Most AUC performances of bagging predictors are statistically superior to single learners, except for Support Vector Machines (SVM) and Decision Stump (DStump).

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[4]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[5]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Chengqi Zhang,et al.  An Empirical Study of Bagging Predictors for Different Learning Algorithms , 2011, AAAI.

[8]  Longin Jan Latecki,et al.  Improving SVM Classification on Imbalanced Data Sets in Distance Spaces , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[9]  Jesus A. Gonzalez,et al.  Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic , 2006, FLAIRS.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[12]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[15]  Hendrik Blockeel,et al.  Knowledge Discovery in Databases: PKDD 2003 , 2003, Lecture Notes in Computer Science.

[16]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[17]  Chao-Ton Su,et al.  An Evaluation of the Robustness of MTS for Imbalanced Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[20]  A. Buja,et al.  OBSERVATIONS ON BAGGING , 2006 .

[21]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[22]  R. Bharat Rao,et al.  Data mining for improved cardiac care , 2006, SKDD.

[23]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[24]  Zeng-Chang Qin,et al.  ROC analysis for predictions made by probabilistic classifiers , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[25]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[26]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.