Minority split and gain ratio for a class imbalance

A decision tree is one of most popular classifiers that classifies a balanced data set effectively. For an imbalanced data set, a standard decision tree tends to misclassify instances of a class having tiny number of samples. In this paper, we modify the decision tree induction algorithm by performing a ternary split on continuous-valued attributes focusing on distribution of minority class instances. The algorithm uses the minority variance to rank candidates of the high gain ratio, then it chooses the candidate with the minimum minority entropy. From our experiments with data sets from UCI and Statlog repository, this method achieves the better performance comparing with C4.5 using only gain ratio for imbalanced data sets.

[1]  Krung Sinapiromsaran,et al.  SMOUTE:Synthetics Minority Over-sampling and Under-sampling TEchniques for class imbalanced problem , 2010 .

[2]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[3]  J. R. Quinlan Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  Thanh-Nghi Do,et al.  A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees , 2008, PAKDD.

[6]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[11]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[12]  Gilbert Ritschard,et al.  An asymmetric entropy measure for decision trees , 2006 .

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[17]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[18]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.