A Dynamic Discretization Approach for Constructing Decision Trees with a Continuous Label

In traditional decision (classification) tree algorithms, the label is assumed to be a categorical (class) variable. When the label is a continuous variable in the data, two possible approaches based on existing decision tree algorithms can be used to handle the situations. The first uses a data discretization method in the preprocessing stage to convert the continuous label into a class label defined by a finite set of nonoverlapping intervals and then applies a decision tree algorithm. The second simply applies a regression tree algorithm, using the continuous label directly. These approaches have their own drawbacks. We propose an algorithm that dynamically discretizes the continuous label at each node during the tree induction process. Extensive experiments show that the proposed method outperforms the preprocessing approach, the regression tree approach, and several nontree-based algorithms.

[1]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[2]  Francisco Herrera,et al.  Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability , 2007, Data Knowl. Eng..

[3]  Ramón López de Mántaras,et al.  Proposal and Empirical Comparison of a Parallelizable Distance-Based Discretization Method , 1997, KDD.

[4]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[5]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[6]  Vir V. Phoha,et al.  K-Means+ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-Means Clustering and ID3 Decision Tree Learning Methods , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  S. Sathiya Keerthi,et al.  Improvements to the SMO algorithm for SVM regression , 2000, IEEE Trans. Neural Networks Learn. Syst..

[8]  Leszek Borzemski THE USE OF DATA MINING TO PREDICT WEB PERFORMANCE , 2006, Cybern. Syst..

[9]  Stefan Kramer,et al.  Structural Regression Trees , 1996, AAAI/IAAI, Vol. 1.

[10]  Selwyn Piramuthu Feature Selection for Financial Credit-Risk Evaluation Decisions , 1999, INFORMS J. Comput..

[11]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[12]  Thierry Van de Merckt Decision Trees in Numerical Attribute Spaces , 1993, IJCAI.

[13]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[14]  Lior Rokach,et al.  An Introduction to Decision Trees , 2007 .

[15]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[16]  Rebecca N. Wright,et al.  Privacy-preserving imputation of missing data , 2008, Data Knowl. Eng..

[17]  Desheng Dash Wu Detecting information technology impact on firm performance using DEA and decision tree , 2006, Int. J. Inf. Technol. Manag..

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  R. Ceulemans,et al.  Decision Tree Algorithm for Detection of Spatial Processes in Landscape Transformation , 2004, Environmental management.

[20]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[21]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[22]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[23]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[24]  LastMark Online classification of nonstationary data streams , 2002 .

[25]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[26]  Jingfei Yang,et al.  Short-term load forecasting with increment regression tree , 2006 .

[27]  James T. C. Teng,et al.  A Dynamic Programming Based Pruning Method for Decision Trees , 2001, INFORMS J. Comput..

[28]  Jerzy W. Grzymala-Busse,et al.  Global discretization of continuous attributes as preprocessing for machine learning , 1996, Int. J. Approx. Reason..

[29]  James B. Ayers,et al.  Handbook of Supply Chain Management , 2000 .

[30]  Abraham Kandel,et al.  Using Data Mining For Automated Software Testing , 2004, Int. J. Softw. Eng. Knowl. Eng..

[31]  Yen-Liang Chen,et al.  Constructing a multi-valued and multi-labeled decision tree , 2003, Expert Syst. Appl..

[32]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[33]  Dino Pedreschi,et al.  A classification-based methodology for planning audit strategies in fraud detection , 1999, KDD '99.