C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure

Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the \interesting" or \abnormal" class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets | quality of probabilistic estimates, pruning, and eect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[3]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[4]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[5]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[6]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[7]  Moninder Singh,et al.  Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management , 1996, ICML.

[8]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[9]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[10]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[11]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[12]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[13]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[14]  James Cussens Bayes and Pseudo-Bayes Estimates of Conditional Probabilities and Their Reliability , 1993, ECML.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Carey E. Priebe,et al.  COMPARATIVE EVALUATION OF PATTERN RECOGNITION TECHNIQUES FOR DETECTION OF MICROCALCIFICATIONS IN MAMMOGRAPHY , 1993 .

[18]  Mark R. Wade,et al.  Construction and Assessment of Classification Rules , 1999, Technometrics.

[19]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.

[20]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..