Alternating optimization of decision trees, with application to learning sparse oblique trees

Learning a decision tree from data is a difficult optimization problem. The most widespread algorithm in practice, dating to the 1980s, is based on a greedy growth of the tree structure by recursively splitting nodes, and possibly pruning back the final tree. The parameters (decision function) of an internal node are approximately estimated by minimizing an impurity measure. We give an algorithm that, given an input tree (its structure and the parameter values at its nodes), produces a new tree with the same or smaller structure but new parameter values that provably lower or leave unchanged the misclassification error. This can be applied to both axis-aligned and oblique trees and our experiments show it consistently outperforms various other algorithms while being highly scalable to large datasets and trees. Further, the same algorithm can handle a sparsity penalty, so it can learn sparse oblique trees, having a structure that is a subset of the original tree and few nonzero parameters. This combines the best of axis-aligned and oblique trees: flexibility to model correlated data, low generalization error, fast inference and interpretable nodes that involve only a few features in their decision.

[1]  Mandy Eberhart,et al.  Decision Forests For Computer Vision And Medical Image Analysis , 2016 .

[2]  Dimitris Bertsimas,et al.  Optimal classification trees , 2017, Machine Learning.

[3]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[4]  Miguel Á. Carreira-Perpiñán,et al.  "Learning-Compression" Algorithms for Neural Net Pruning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[7]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[8]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[9]  C. Apte,et al.  Data mining with decision trees and decision rules , 1997, Future Gener. Comput. Syst..

[10]  Kristin P. Bennett,et al.  Global Tree Optimization: A Non-greedy Decision Tree Algorithm , 2007 .

[11]  Saurabh Goyal,et al.  Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things , 2017, ICML.

[12]  C. O’Brien Statistical Learning with Sparsity: The Lasso and Generalizations , 2016 .

[13]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[14]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[15]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[16]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[17]  Miguel Á. Carreira-Perpiñán,et al.  Model compression as constrained optimization, with application to neural nets. Part I: general framework , 2017, ArXiv.

[18]  Hans Ulrich Simon,et al.  Robust Trainability of Single Neurons , 1995, J. Comput. Syst. Sci..

[19]  Gitta Kutyniok,et al.  1 . 2 Sparsity : A Reasonable Assumption ? , 2012 .

[20]  David J. Fleet,et al.  CO2 Forest: Improved Random Forest by Continuous Optimization of Oblique Splits , 2015, ArXiv.

[21]  Nimrod Megiddo,et al.  On the complexity of polyhedral separability , 1988, Discret. Comput. Geom..

[22]  Christos Tjortjis,et al.  T3C: improving a decision tree classification algorithm’s interval splits on continuous attributes , 2017, Adv. Data Anal. Classif..

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[27]  David J. Fleet,et al.  Efficient Non-greedy Optimization of Decision Trees , 2015, NIPS.

[28]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[29]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, FOCS.

[30]  Kristin P. Bennett,et al.  Decision Tree Construction Via Linear Programming , 1992 .

[31]  Nate Derbinsky,et al.  The Boundary Forest Algorithm for Online Supervised and Unsupervised Learning , 2015, AAAI.

[32]  John A. Keane,et al.  T3: A Classification Algorithm for Data Mining , 2002, IDEAL.

[33]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[34]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[35]  Peter Auer,et al.  Theory and Applications of Agnostic PAC-Learning with Small Decision Trees , 1995, ICML.

[36]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..