A Robust Classifier for Imbalanced Datasets

Imbalanced dataset classification is a challenging problem, since many classifiers are sensitive to class distribution so that the classifiers’ prediction has bias towards majority class. Hellinger Distance has been proven that it is skew-insensitive and the decision trees that employ Hellinger Distance as a splitting criterion have shown better performance than other decision trees based on Information Gain. We propose a new decision tree induction classifier (HeDEx) based on Hellinger Distance that is randomized ensemble trees selecting both attribute and split-point at random. We also propose hyperplane as a decision surface for HeDEx to improve the performance. A new pattern-based oversampling method is also proposed in this paper to reduce the bias towards majority class. The patterns are detected from HeDEx and the new instances generated are applied after verification process using Hellinger Distance Decision Trees. Our experiments show that the proposed methods show performance improvements on imbalanced datasets over the state-of-the-art Hellinger Distance Decision Trees.

[1]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[2]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[3]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[4]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[5]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[8]  H. Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009 .

[9]  Kotagiri Ramamohanarao,et al.  Using emerging patterns and decision trees in rare-class classification , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[10]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[11]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[12]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[13]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[14]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[15]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[17]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[18]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[19]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[20]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.