Pruning Decision Trees and Lists

Machine learning algorithms are techniques that automatically build models describing the structure at the heart of a set of data. Ideally, such models can be used to predict properties of future data points and people can use them to analyze the domain from which the data originates. Decision trees and lists are potentially powerful predictors and embody an explicit representation of the structure in a dataset. Their accuracy and comprehensibility depends on how concisely the learning algorithm can summarize this structure. The final model should not incorporate spurious effects—patterns that are not genuine features of the underlying domain. Given an efficient mechanism for determining when a particular effect is due to chance alone, non-predictive parts of a model can be eliminated or “pruned.” Pruning mechanisms require a sensitive instrument that uses the data to detect whether there is a genuine relationship between the components of a model and the domain. Statistical significance tests are theoretically well-founded tools for doing exactly that. This thesis presents pruning algorithms for decision trees and lists that are based on significance tests. We explain why pruning is often necessary to obtain small and accurate models and show that the performance of standard pruning algorithms can be improved by taking the statistical significance of observations into account. We compare the effect of parametric and non-parametric tests, analyze why current pruning algorithms for decision lists often prune too aggressively, and review related work—in particular existing approaches that use significance tests in the context of pruning. The main outcome of this investigation is a set of simple pruning algorithms that should prove useful in practical data mining applications.

[1]  Ian H. Witten,et al.  Using a Permutation Test for Attribute Selection in Decision Trees , 1998, ICML.

[2]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[3]  D. Wolpert On Overfitting Avoidance as Bias , 1993 .

[4]  Cullen Schaffer Sparse Data and the Effect of Overfitting Avoidance in Decision Tree Induction , 1992, AAAI.

[5]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[6]  Brian R. Gaines An Ounce of Knowledge is Worth a Ton of Data: Quantitative studies of the Trade-Off between Expertise and Data Based On Statistically Well-Founded Empirical Induction , 1989, ML.

[7]  James L. McClelland,et al.  James L. McClelland, David Rumelhart and the PDP Research Group, Parallel distributed processing: explorations in the microstructure of cognition . Vol. 1. Foundations . Vol. 2. Psychological and biological models . Cambridge MA: M.I.T. Press, 1987. , 1989, Journal of Child Language.

[8]  I. K. Sethi,et al.  Hierarchical Classifier Design Using Mutual Information , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  G. Kalkanis,et al.  The application of confidence interval error analysis to the design of decision tree classifiers , 1993, Pattern Recognit. Lett..

[10]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[11]  Norman Cliff,et al.  Analyzing Multivariate Data , 1987 .

[12]  W. Patefield,et al.  An Efficient Method of Generating Random R × C Tables with Given Row and Column Totals , 1981 .

[13]  Antal van den Bosch,et al.  When small disjuncts abound, try lazy learning: A case study , 1997 .

[14]  Tim Oates,et al.  Toward a Theoretical Understanding of Why and When Decision Tree Pruning Algorithms Fail , 1999, AAAI/IAAI.

[15]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[16]  S. J. Hong,et al.  Use of Randomization to Normalize Feature Merits , 1999 .

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Michael J. Pazzani,et al.  An Investigation of Noise-Tolerant Relational Concept Learning Algorithms , 1991, ML.

[19]  J. Ross Quinlan,et al.  Generating Production Rules from Decision Trees , 1987, IJCAI.

[20]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[21]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[22]  Robin H. Lock,et al.  A sequential approximation to a permutation test , 1991 .

[23]  Carla E. Brodley,et al.  Pruning Decision Trees with Misclassification Costs , 1998, ECML.

[24]  Yishay Mansour,et al.  A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization , 1998, ICML.

[25]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[26]  Ivan Bratko,et al.  ASSISTANT 86: A Knowledge-Elicitation Tool for Sophisticated Users , 1987, EWSL.

[27]  Yoram Singer,et al.  An efficient extension to mixture techniques for prediction and decision trees , 1997, COLT '97.

[28]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[29]  Pedro M. Domingos Unifying Instance-Based and Rule-Based Induction , 1996, Machine Learning.

[30]  Michael Kearns,et al.  A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.

[31]  D. Fisher Pessimistic and Optimistic Induction , 1992 .

[32]  Ivan Bratko,et al.  On Estimating Probabilities in Tree Pruning , 1991, EWSL.

[33]  Gholamreza Nakhaeizadeh,et al.  Cost-Sensitive Pruning of Decision Trees , 1994, ECML.

[34]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[35]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[36]  H. O. Lancaster,et al.  Significance Tests in Discrete Distributions , 1961 .

[37]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[38]  Robert Tibshirani,et al.  Bias, Variance and Prediction Error for Classification Rules , 1996 .

[39]  Xiaobo Li,et al.  Tree classifier design with a permutation statistic , 1986, Pattern Recognit..

[40]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[41]  Ian H. Witten,et al.  Making Better Use of Global Discretization , 1999, ICML.

[42]  Tim Oates,et al.  Large Datasets Lead to Overly Complex Models: An Explanation and a Solution , 1998, KDD.

[43]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[44]  I. Bratko,et al.  Learning decision rules in noisy domains , 1987 .

[45]  Zijian Zheng Scaling Up the Rule Generation of C4.5 , 1998, PAKDD.

[46]  Gary J. Koehler,et al.  An investigation on the conditions of pruning an induced decision tree , 1994 .

[47]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[48]  William Mendenhall,et al.  Introduction to Probability and Statistics , 1961, The Mathematical Gazette.

[49]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[50]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[51]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[52]  W. Loh,et al.  Tree-Structured Classification Via Generalized Discriminant Analysis: Rejoinder , 1988 .

[53]  Geoffrey I. Webb Recent Progress in Learning Decision Lists by Prepending Inferred Rules , 1994 .

[54]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[55]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[56]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[57]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[58]  Donato Malerba,et al.  Simplifying Decision Trees by Pruning and Grafting: New Results (Extended Abstract) , 1995, ECML.

[59]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[60]  Richard S. Forsyth,et al.  Overfitting revisited: an information-theoretic approach to simplifying discrimination trees , 1994, J. Exp. Theor. Artif. Intell..

[61]  Hussein Almuallim,et al.  An Efficient Algorithm for Optimal Pruning of Decision Trees , 1996, Artif. Intell..

[62]  Johannes Fürnkranz,et al.  Incremental Reduced Error Pruning , 1994, ICML.

[63]  Richard Nock,et al.  On the Power of Decision Lists , 1998, ICML.

[64]  Thomas G. Dietterich,et al.  An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms , 1995, Machine Learning.

[65]  Stuart L. Crawford Extensions to the CART Algorithm , 1989, Int. J. Man Mach. Stud..

[66]  Tim Niblett,et al.  Constructing Decision Trees in Noisy Domains , 1987, EWSL.

[67]  Haym Hirsh,et al.  The Problem with Noise and Small Disjuncts , 1998, ICML.

[68]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[69]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[70]  Kai Ming Ting,et al.  Inducing Cost-Sensitive Trees via Instance Weighting , 1998, PKDD.

[71]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[72]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[73]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[74]  Sholom M. Weiss,et al.  Small Sample Decision tree Pruning , 1994, ICML.

[75]  David D. Jensen,et al.  Adjusting for Multiple Comparisons in Decision Tree Pruning , 1997, KDD.

[76]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[77]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[78]  David W. Aha,et al.  Simplifying decision trees: A survey , 1997, The Knowledge Engineering Review.

[79]  Douglas H. Fisher,et al.  Concept Simplification and Prediction Accuracy , 1988, ML.

[80]  Simon Kasif,et al.  Efficient Algorithms for Finding Multi-way Splits for Decision Trees , 1995, ICML.

[81]  Gary M. Weiss Learning with Rare Cases and Small Disjuncts , 1995, ICML.

[82]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[83]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[84]  Cullen Schaffer When Does Overfitting Decrease Prediction Accuracy in Induced Decision Trees and Rule Sets? , 1991, EWSL.

[85]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[86]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[87]  Robert Milne,et al.  Research and Development in Expert Systems IX , 1993 .

[88]  Paul R. Cohen,et al.  Building Simple Models: A Case Study with Decision Trees , 1997, IDA.

[89]  Sholom M. Weiss,et al.  Decision Tree Pruning: Biased or Optimal? , 1994, AAAI.

[90]  Richard Nock,et al.  Decision tree based induction of decision lists , 1999, Intell. Data Anal..

[91]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[92]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[93]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[94]  Yishay Mansour Pessimistic Decision Tree Pruning Based on Tree Size , 1997, ICML 1997.

[95]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[96]  John Mingers,et al.  Expert Systems—Rule Induction with Statistical Data , 1987 .

[97]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[98]  Padhraic Smyth,et al.  Decision tree design from a communication theory standpoint , 1988, IEEE Trans. Inf. Theory.

[99]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[100]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[101]  Jorma Rissanen,et al.  MDL-Based Decision Tree Pruning , 1995, KDD.

[102]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[103]  Geoff Holmes,et al.  Generating Rule Sets from Model Trees , 1999, Australian Joint Conference on Artificial Intelligence.

[104]  Dimitrios Kalles,et al.  Decision Trees And Domain Knowledge In Pattern Recognition , 1994 .

[105]  Edward J. Delp,et al.  An iterative growing and pruning algorithm for classification tree design , 1989, Conference Proceedings., IEEE International Conference on Systems, Man and Cybernetics.

[106]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[107]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[108]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[109]  David D. Jensen,et al.  Induction with randomization testing: decision-oriented analysis of large data sets , 1992 .