Implementation and Evaluation of Decision Trees with Range and Region Splitting

We propose an extension of an entropy-based heuristic for constructing a decision tree from a large database with many numeric attributes. When it comes to handling numeric attributes, conventional methods are inefficient if any numeric attributes are strongly correlated. Our approach offers one solution to this problem. For each pair of numeric attributes with strong correlation, we compute a two-dimensional association rule with respect to these attributes and the objective attribute of the decision tree. In particular, we consider a family R of grid-regions in the plane associated with the pairof attributes. For R ∈ R, the data canbe split into two classes: data inside R and dataoutside R. We compute the region Ropt∈ R that minimizes the entropy of the splitting,and add the splitting associated with Ropt (foreach pair of strongly correlated attributes) to the set of candidatetests in an entropy-based heuristic. We give efficient algorithmsfor cases in which R is (1) x-monotone connected regions, (2) based-monotone regions, (3) rectangles, and (4) rectilinear convex regions. The algorithm has been implemented as a subsystem of SONAR (System for Optimized Numeric Association Rules) developed by the authors. We have confirmed that we can compute the optimal region efficiently. And diverse experiments show that our approach can create compact trees whose accuracy is comparable with or better than that of conventional trees. More importantly, we can grasp non-linear correlation among numeric attributes which could not be found without our region splitting.

[1]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[2]  Yasuhiko Morimoto,et al.  Computing Optimized Rectilinear Regions for Association Rules , 1997, KDD.

[3]  Tetsuo Asano,et al.  Partial Construction of an Arrangement of Lines and Its Application to Optimal Partitioning of Bichromatic Point Set (Special Section on Discrete Mathematics and Its Applications) , 1994 .

[4]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[5]  P. Utgoff,et al.  Multivariate Decision Trees , 1995, Machine Learning.

[6]  Yasuhiko Morimoto,et al.  Mining optimized association rules for numeric attributes , 1996, J. Comput. Syst. Sci..

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[9]  Leonidas J. Guibas,et al.  Fractional Cascading: A Data Structuring Technique with Geometric Applications , 1985, ICALP.

[10]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[11]  Tetsuo Asano,et al.  Polynomial-time solutions to image segmentation , 1996, SODA '96.

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  Carla E. Brodley,et al.  Multivariate decision trees , 2004, Machine Learning.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[16]  Yasuhiko Morimoto,et al.  SONAR: system for optimized numeric association rules , 1996, SIGMOD '96.

[17]  David P. Dobkin,et al.  Probing Convex Polytopes , 1990, Autonomous Robot Vehicles.

[18]  David Eppstein,et al.  Computing the discrepancy , 1993, SCG '93.

[19]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[20]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[21]  Jorma Rissanen,et al.  MDL-Based Decision Tree Pruning , 1995, KDD.

[22]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.