论文信息 - C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure

C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure

Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the \interesting" or \abnormal" class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets | quality of probabilistic estimates, pruning, and eect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets.

Nitesh V. Chawla | N. Chawla

[1] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[2] Andrew P. Bradley,et al. The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[3] JapkowiczNathalie,et al. The class imbalance problem: A systematic study , 2002 .

[4] Bianca Zadrozny,et al. Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[5] Pedro M. Domingos,et al. Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[6] Steven Salzberg,et al. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[7] Moninder Singh,et al. Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management , 1996, ICML.

[8] Susan T. Dumais,et al. Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[9] Dunja Mladenic,et al. Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[10] Stan Matwin,et al. Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[11] David D. Lewis,et al. A comparison of two learning algorithms for text categorization , 1994 .