Learning Binary Trees via Sparse Relaxation

One of the most classical problems in machine learning is how to learn binary trees that split data into useful partitions. From classification/regression via decision trees to hierarchical clustering, binary trees are useful because they (a) are often easy to visualize; (b) make computationally-efficient predictions; and (c) allow for flexible partitioning. Because of this there has been extensive research on how to learn such trees that generally fall into one of three categories: 1. greedy node-by-node optimization; 2. probabilistic relaxations for differentiability; 3. mixed-integer programs (MIP). Each of these have downsides: greedy can myopically choose poor splits, probabilistic relaxations do not have principled ways to prune trees, MIP methods can be slow on large problems and may not generalize. In this work we derive a novel sparse relaxation for binary tree learning. By deriving a new MIP and sparsely relaxing it, our approach is able to learn tree splits and tree pruning using argmin differentiation. We demonstrate how our approach is easily visualizable and is competitive with current tree-based approaches in classification/regression and hierarchical clustering. Source code is available at this http URL .

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Jerry Ma,et al.  Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.

[3]  Yongxin Yang,et al.  Deep Neural Decision Trees , 2018, ArXiv.

[4]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[5]  Phebe Vayanos,et al.  Learning Optimal Classification Trees: Strong Max-Flow Formulations , 2020, ArXiv.

[6]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[7]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[9]  J. Leeuw,et al.  Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods , 2009 .

[10]  Andreas Krause,et al.  Differentiable Learning of Submodular Models , 2017, NIPS 2017.

[11]  Sivaraman Balakrishnan,et al.  Efficient Active Algorithms for Hierarchical Clustering , 2012, ICML.

[12]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[13]  Justin Domke,et al.  Learning Graphical Model Parameters with Approximate Marginal Inference , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Stephen P. Boyd,et al.  Differentiating through a cone program , 2019, Journal of Applied and Numerical Optimization.

[15]  Stephen P. Boyd,et al.  Differentiable Convex Optimization Layers , 2019, NeurIPS.

[16]  Andrew McCallum,et al.  Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space , 2019, KDD.

[17]  Dimitris Bertsimas,et al.  Optimal classification trees , 2017, Machine Learning.

[18]  H. D. Brunk,et al.  Statistical inference under order restrictions : the theory and application of isotonic regression , 1973 .

[19]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[20]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[21]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[22]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[23]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[24]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[25]  Margo I. Seltzer,et al.  Optimal Sparse Decision Trees , 2019, NeurIPS.

[26]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[27]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[28]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[29]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[30]  Yingqian Zhang,et al.  Learning Optimal Classification Trees Using a Binary Linear Program Formulation , 2019, BNAIC/BENELEARN.

[31]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[33]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[34]  Max Welling,et al.  Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[35]  Oktay Günlük,et al.  Optimal decision trees for categorical data via integer programming , 2021, Journal of Global Optimization.

[36]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[37]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[38]  Ethem Alpaydin,et al.  Soft decision trees , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[39]  Stephen P. Boyd,et al.  OSQP: an operator splitting solver for quadratic programs , 2017, 2018 UKACC 12th International Conference on Control (CONTROL).

[40]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[41]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[42]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[43]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[44]  Phebe Vayanos,et al.  Learning Optimal and Fair Decision Trees for Non-Discriminative Decision-Making , 2019, AAAI.

[45]  E. Xing,et al.  Exact Algorithms for Isotonic Regression and Related , 2016 .

[46]  Stephen Boyd,et al.  A Rewriting System for Convex Optimization Problems , 2017, ArXiv.

[47]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[48]  Nathan Lay,et al.  Random Hinge Forest for Differentiable Learning , 2018, ArXiv.

[49]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[50]  Sergei Popov,et al.  Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data , 2019, ICLR.

[51]  Anoop Cherian,et al.  On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization , 2016, ArXiv.

[52]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[53]  Cynthia Rudin,et al.  Generalized Optimal Sparse Decision Trees , 2020, ArXiv.

[54]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[55]  Stephen P. Boyd,et al.  Differentiating Through a Conic Program , 2019 .