DECISION TREES DO NOT GENERALIZE TO NEW VARIATIONS

The family of decision tree learning algorithms is among the most widespread and studied. Motivated by the desire to develop learning algorithms that can generalize when learning highly varying functions such as those presumably needed to achieve artificial intelligence, we study some theoretical limitations of decision trees. We demonstrate formally that they can be seriously hurt by the curse of dimensionality in a sense that is a bit different from other nonparametric statistical methods, but most importantly, that they cannot generalize to variations not seen in the training set. This is because a decision tree creates a partition of the input space and needs at least one example in each of the regions associated with a leaf to make a sensible prediction in that region. A better understanding of the fundamental reasons for this limitation suggests that one should use forests or even deeper architectures instead of trees, which provide a form of distributed representation and can generalize to variations not encountered in the training data.

[1]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[4]  Yoshua Bengio,et al.  Nonlocal Estimation of Manifold Structure , 2006, Neural Computation.

[5]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[6]  Pascal Vincent,et al.  Non-Local Manifold Parzen Windows , 2005, NIPS.

[7]  주철환 H.O.T , 1999 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[12]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[13]  Larry A. Rendell,et al.  Learning Despite Concept Variation by Finding Structure in Attribute-based Data , 1996, ICML.

[14]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[15]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[16]  Dima Grigoriev,et al.  Complexity Lower Bounds for Approximation Algebraic Computation Trees , 1999, J. Complex..

[17]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[18]  Geoffrey E. Hinton,et al.  Extracting distributed representations of concepts and relations from positive and negative propositions , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[19]  Miklós Ajtai,et al.  ∑11-Formulae on finite structures , 1983, Ann. Pure Appl. Log..

[20]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[21]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[22]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[23]  Larry A. Rendell,et al.  Global Data Analysis and the Fragmentation Problem in Decision Tree Induction , 1997, ECML.

[24]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[25]  Johan Håstad,et al.  On the power of small-depth threshold circuits , 1991, computational complexity.

[26]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[27]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[28]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[29]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[30]  Johan Håstad,et al.  Almost optimal lower bounds for small depth circuits , 1986, STOC '86.

[31]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[32]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[33]  Nicolas Le Roux,et al.  The Curse of Highly Variable Functions for Local Kernel Machines , 2005, NIPS.

[34]  A. Yao,et al.  An exponential lower bound on the size of algebraic decision trees for Max , 1998, computational complexity.

[35]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.