Predicting Nearly as Well as the Best Pruning of a Decision Tree

Many algorithms for inferring a decision tree from data involve a two-phase process: First, a very large decision tree is grown which typically ends up “over-fitting” the data. To reduce over-fitting, in the second phase, the tree is pruned using one of a number of available methods. The final tree is then output and used for classification on test data. In this paper, we suggest an alternative approach to the pruning phase. Using a given unpruned decision tree, we present a new method of making predictions on test data, and we prove that our algorithm‘s performance will not be “much worse” (in a precise technical sense) than the predictions made by the best reasonably small pruning of the given decision tree. Thus, our procedure is guaranteed to be competitive (in terms of the quality of its predictions) with {\it any} pruning algorithm. We prove that our procedure is very efficient and highly robust. Our method can be viewed as a synthesis of two previously studied techniques. First, we apply Cesa-Bianchi et al.‘s (1993) results on predicting using “expert advice” (where we view each pruning as an “expert”) to obtain an algorithm that has provably low prediction loss, but that is computationally infeasible. Next, we generalize and apply a method developed by Buntine (1990, 1992) and Willems, Shtarkov and Tjalkens (1993, 1995) to derive a very efficient implementation of this procedure.

[1]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[2]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[3]  Chris Carter,et al.  Multiple decision trees , 2013, UAI.

[4]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[5]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Marcelo J. Weinberger,et al.  Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel-Ziv algorithm , 1992, IEEE Trans. Inf. Theory.

[8]  Abraham Lempel,et al.  A sequential algorithm for the universal coding of finite memory sources , 1992, IEEE Trans. Inf. Theory.

[9]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[10]  Frans M. J. Willems,et al.  Context Tree Weighting : A Sequential Universal Source Coding Procedure for Fsmx Sources , 1993, Proceedings. IEEE International Symposium on Information Theory.

[11]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[12]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[13]  Dana Ron,et al.  Learning probabilistic automata with variable memory length , 1994, COLT '94.

[14]  David J. Hand,et al.  Averaging Over Decision Stumps , 1994, ECML.

[15]  Neri Merhav,et al.  Optimal sequential probability assignment for individual sequences , 1994, IEEE Trans. Inf. Theory.

[16]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[17]  Meir Feder,et al.  A universal finite memory source , 1995, IEEE Trans. Inf. Theory.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.