Inducing Decision Trees based on a Cluster Quality Index

Decision trees are popular classifiers in data mining, artificial intelligence, and pattern recognition, because they are accurate and easy to comprehend. In this paper, we introduce a new procedure for inducing decision trees, to obtain trees that are more accurate, more compact, and more balanced. Each candidate split is evaluated using Rand Statistics, a quality index based on external measures, because it is considered by many authors as the best existing index. Our method was compared with other state-of-the-art methods and the results over 30 databases from the UCI Repository prove our claims. We also introduce a new equation to measure the balance of a binary tree.

[1]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[2]  Hong-Yeop Song,et al.  A New Criterion in Selection and Discretization of Attributes for the Generation of Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Liangxiao Jiang,et al.  An Improved Attribute Selection Measure for Decision Tree Induction , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[4]  Jen-Tzung Chien,et al.  Compact decision trees with cluster validity for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  J. Ruiz-Shulcloper,et al.  Pattern recognition with mixed and incomplete data , 2008, Pattern Recognition and Image Analysis.

[6]  Tharam S. Dillon,et al.  A Statistical-Heuristic Feature Selection Criterion for Decision Tree Induction , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Matthew N. Anyanwu,et al.  Comparative Analysis of Serial Decision Tree Classification Algorithms , 2009 .

[8]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[9]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[10]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[11]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[12]  Gilbert Saporta,et al.  Comparing partitions of two sets of units based on the same variables , 2010, Adv. Data Anal. Classif..

[13]  Vili Podgorelec,et al.  Decision Trees: An Overview and Their Use in Medicine , 2002, Journal of Medical Systems.

[14]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[17]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[18]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[19]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[20]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[22]  Ravi Kothari,et al.  A new node splitting measure for decision tree construction , 2010, Pattern Recognit..