When Costs Are Unequal and Unknown: A Subtree Grafting Approach for Unbalanced Data Classification

In binary classifications, a decision tree learned from unbalanced data typically creates an important challenge related to the high misclassification rate of the minority class. Assigning different misclassification costs can address this problem, though usually at the cost of accuracy for the majority class. This effect can be particularly hazardous if the costs cannot be specified precisely. When the costs are unknown or difficult to determine, decision makers may prefer a classifier with more balanced accuracy for both classes rather than a standard or cost-sensitively learned one. In the context of learning trees, this research therefore proposes a new tree induction approach called subtree grafting (STG). On the basis of a real bank data set and several other data sets, we test the proposed STG method and find that our proposed approach provides a successful compromise between standard and cost-sensitive trees.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[3]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[4]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[5]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[6]  Qiang Yang,et al.  Test-cost sensitive naive Bayes classification , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[7]  Victor L. Berardi,et al.  An investigation of neural networks in thyroid function diagnosis , 1998, Health care management science.

[8]  Bianca Zadrozny,et al.  Guest editorial: special issue on utility-based data mining , 2008, Data Mining and Knowledge Discovery.

[9]  Jerrold H. May,et al.  Evaluating and Tuning Predictive Data Mining Models Using Receiver Operating Characteristic Curves , 2004, J. Manag. Inf. Syst..

[10]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[11]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[12]  Christoph Hueglin,et al.  Data mining techniques to improve forecast accuracy in airline business , 2001, KDD '01.

[13]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[14]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[15]  Huimin Zhao,et al.  Tuning Data Mining Methods for Cost-Sensitive Regression: A Study in Loan Charge-Off Forecasting , 2008, J. Manag. Inf. Syst..

[16]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[17]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[18]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[19]  T.M. Padmaja,et al.  Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  Xiaoning Zhang,et al.  Data Mining for Network Intrusion Detection: A Comparison of Alternative Methods , 2001, Decis. Sci..

[22]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[23]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[24]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[25]  L.M. Patnaik,et al.  Genetic Algorithm with Characteristic Amplification through Multiple Geographically Isolated Populations and Varied Fitness Landscapes , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[26]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[27]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[28]  Tom Fawcett,et al.  Robust Classification Systems for Imprecise Environments , 1998, AAAI/IAAI.

[29]  Huimin Zhao,et al.  A multi-objective genetic programming approach to developing Pareto optimal decision trees , 2007, Decis. Support Syst..

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Tom Fawcett PRIE: a system for generating rulelists to maximize ROC performance , 2008, Data Mining and Knowledge Discovery.

[32]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[33]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[34]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[35]  Giorgio Valentini,et al.  Support vector machines for candidate nodules classification , 2005, Neurocomputing.

[36]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.