Fast Decision Tree Algorithm

There is a growing interest nowadays to process large amounts of data using the well-known decision-tree learning algorithms. Building a decision tree as fast as possible against a large dataset without substantial decrease in accuracy and using as little memory as possible is essential. In this paper we present an improved C4.5 algorithm that uses a compression mechanism to store the training and test data in memory. We also present a very fast tree pruning algorithm. Our experiments show that presented algorithms perform better than C5.0 in terms of speed and classification accuracy in most cases at the expense of tree size the resulting trees are larger than the ones produced by C5.0. The data compression and pruning algorithms can be easily parallelized in order to achieve further speedup.

[1]  Ji-Hwan Kim,et al.  Domain Independent Vocabulary Generation and Its Use in Category-based Small Footprint Language Model , 2011 .

[2]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[5]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.

[6]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[7]  E. C. Vasconcellos,et al.  DECISION TREE CLASSIFIERS FOR STAR/GALAXY SEPARATION , 2010, 1011.1951.

[8]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[9]  Peter J. Huber,et al.  From Large to Huge: A Statistician's Reactions to KDD & DM , 1997, KDD.

[10]  Yafen Li,et al.  A Post-Pruning Decision Tree Algorithm Based on Bayesian , 2013, 2013 International Conference on Computational and Information Sciences.

[11]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[13]  W. N. H. W. Mohamed,et al.  A comparative study of Reduced Error Pruning method in decision tree algorithms , 2012, 2012 IEEE International Conference on Control System, Computing and Engineering.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Jie Chen,et al.  Pruning Decision Tree Using Genetic Algorithms , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[16]  Yangdong Ye,et al.  Forest pruning based on Tree-Node Order , 2011, 2011 IEEE International Conference on Computer Science and Automation Engineering.

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[19]  David D. Jensen,et al.  Adjusting for Multiple Comparisons in Decision Tree Pruning , 1997, KDD.

[20]  Jie Ouyang,et al.  Chi-Square Test Based Decision Trees Induction in Distributed Environment , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[21]  Sam Lightstone,et al.  Data Mining - Know It All , 2008 .

[22]  Yishay Mansour,et al.  A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization , 1998, ICML.

[23]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.