Overprvning Large Decision Trees

This paper presents empirical evidence for five hypotheses about learning from large noisy domains: that trees built from very large training sets are larger and more accurate than trees built from even large subsets; that this increased accuracy is only in part due to the extra size of the trees; and that the extra training instances allow both better choices of attribute while building the tree, and better choices of the subtrees to prune after it has been built. For the practitioner with the common goals of maximising the accuracy and minimising the size of induced trees, these conclusions prompt new techniques for induction on large training sets. Although building huge trees from huge training sets is computationally expensive, pruning smaller trees on them is not, yet it improves accuracy. Where a pruned tree is considered too large for human or machine limitations, it can be overpruncd to an acceptable size. Although this requires far more time than building a tree of that size from a correspondingly small training set, it wi l l usually be more accurate. The paper also describes an algorithm for overpruning trees to user-specified size limits; it is evaluated in the course of testing the above hypotheses.