SkewBoost: An algorithm for classifying imbalanced datasets

Many real world data sets have an imbalanced distribution of the instances. Learning from such data sets results in the classifier being biased towards the majority class, thereby tending to misclassify the minority class samples. In this paper, we provide a technique, SkewBoost which classifies the minority instances correctly without compromising much on the correct classification of the majority instances. In the SkewBoost technique, minority and majority instances are identified during execution of the boosting algorithm. A variation of SMOTE is used to create synthetic minority instances which are then added to the training set and total weight is rebalanced. After each iteration of the boosting algorithm, the weight of each instance is modified to focus more on the misclassified instances. A cost-sensitive approach has been adopted to reweight the instances following every iteration. This method is evaluated, in terms of the F-measure, G-mean, AUC, Recall and Precision, on imbalanced data sets against the results that have been published in the previous publications of algorithms on imbalanced datasets.

[1]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Josef Kittler,et al.  A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling , 2009, MCS.

[4]  Nuno Vasconcelos,et al.  Asymmetric boosting , 2007, ICML '07.

[5]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[6]  Stephen D. Bay,et al.  Large Scale Detection of Irregularities in Accounting Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Jingrui He,et al.  Rare category analysis , 2010 .

[8]  Weng-Keen Wong,et al.  Category detection using hierarchical mean shift , 2009, KDD.

[9]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[10]  Mark J. Lawson,et al.  The Search for a Cost Matrix to Solve Rare-Class Biological Problems , 2009 .

[11]  Taghi M. Khoshgoftaar,et al.  Resampling or Reweighting: A Comparison of Boosting Implementations , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[12]  Son Lam Phung,et al.  Learning Pattern Classification Tasks with Imbalanced Data Sets , 2009 .

[13]  Hui Xiong,et al.  Local decomposition for rare class analysis , 2007, KDD '07.

[14]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[15]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[16]  Kamlesh Laddhad,et al.  Methods for Handling Highly Skewed Datasets , 2005 .

[17]  Wenhuang Liu,et al.  Rare Class Mining: Progress and Prospect , 2009, 2009 Chinese Conference on Pattern Recognition.

[18]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[19]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.