Learning Binary Decision Trees by Argmin Differentiation

We address the problem of learning binary decision trees that partition data for some downstream task. We propose to learn discrete parameters (i.e., for tree traversals and node pruning) and continuous parameters (i.e., for tree split functions and prediction functions) simultaneously using argmin differentiation. We do so by sparsely relaxing a mixed-integer program for the discrete parameters, to allow gradients to pass through the program to continuous parameters. We derive customized algorithms to efficiently compute the forward and backward passes. This means that our tree learning procedure can be used as an (implicit) layer in arbitrary deep networks, and can be optimized with arbitrary loss functions. We demonstrate that our approach produces binary trees that are competitive with existing single tree and ensemble approaches, in both supervised and unsupervised settings. Further, apart from greedy approaches (which do not have competitive accuracies), our method is faster to train than all other tree-learning baselines we compare with. The code for reproducing the results is available at https://github.com/ vzantedeschi/LatentTrees.

[1]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[2]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[3]  Dimitris Bertsimas,et al.  Optimal classification trees , 2017, Machine Learning.

[4]  Lam M. Nguyen,et al.  A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees , 2020, NeurIPS.

[5]  H. D. Brunk,et al.  Statistical inference under order restrictions : the theory and application of isotonic regression , 1973 .

[6]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[7]  Sivaraman Balakrishnan,et al.  Efficient Active Algorithms for Hierarchical Clustering , 2012, ICML.

[8]  Nathan Lay,et al.  Random Hinge Forest for Differentiable Learning , 2018, ArXiv.

[9]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[10]  Zhenyu Tan,et al.  The Tree Ensemble Layer: Differentiability meets Conditional Computation , 2020, ICML.

[11]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[12]  Cynthia Rudin,et al.  Learning customized and optimized lists of rules with mathematical programming , 2018, Math. Program. Comput..

[13]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[14]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[15]  Ethem Alpaydin,et al.  Soft decision trees , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[16]  Max Welling,et al.  Attention, Learn to Solve Routing Problems! , 2018, ICLR.

[17]  J. C. Schlimmer,et al.  Concept acquisition through representational adjustment , 1987 .

[18]  Yang Liu,et al.  Learning Structured Text Representations , 2017, TACL.

[19]  William S. Meisel,et al.  An Algorithm for Constructing Optimal Binary Decision Trees , 1977, IEEE Transactions on Computers.

[20]  Samuel R. Bowman,et al.  Do latent tree learning models identify meaningful structure in sentences? , 2017, TACL.

[21]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[22]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[23]  J. Danskin The Theory of Max-Min, with Applications , 1966 .

[24]  Phebe Vayanos,et al.  Learning Optimal and Fair Decision Trees for Non-Discriminative Decision-Making , 2019, AAAI.

[25]  Jihun Choi,et al.  Learning to Compose Task-Specific Tree Structures , 2017, AAAI.

[26]  Stephen P. Boyd,et al.  Differentiable Convex Optimization Layers , 2019, NeurIPS.

[27]  Anoop Cherian,et al.  On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization , 2016, ArXiv.

[28]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[29]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[30]  Antonio Criminisi,et al.  Adaptive Neural Trees , 2018, ICML.

[31]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[32]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Veselin Stoyanov,et al.  Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure , 2011, AISTATS.

[34]  Phebe Vayanos,et al.  Learning Optimal Classification Trees: Strong Max-Flow Formulations , 2020, ArXiv.

[35]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[36]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[37]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[38]  Jerry Ma,et al.  Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.

[39]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[41]  Ivan Titov,et al.  Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming , 2019, ACL.

[42]  Stephen Boyd,et al.  A Rewriting System for Convex Optimization Problems , 2017, ArXiv.

[43]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[44]  Margo I. Seltzer,et al.  Learning Certifiably Optimal Rule Lists , 2017, KDD.

[45]  Claire Cardie,et al.  Towards Dynamic Computation Graphs via Sparse Latent Structure , 2018, EMNLP.

[46]  Justin Domke,et al.  Learning Graphical Model Parameters with Approximate Marginal Inference , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Stephen P. Boyd,et al.  Differentiating through a cone program , 2019, Journal of Applied and Numerical Optimization.

[48]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[49]  Wang Ling,et al.  Learning to Compose Words into Sentences with Reinforcement Learning , 2016, ICLR.

[50]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[51]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[52]  Ivan Titov,et al.  Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder , 2018, ICLR.

[53]  Jack Dunn,et al.  Optimal trees for prediction and prescription , 2018 .

[54]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[55]  Miguel Á. Carreira-Perpiñán,et al.  Alternating optimization of decision trees, with application to learning sparse oblique trees , 2018, NeurIPS.

[56]  Kristin P. Bennett,et al.  Decision Tree Construction Via Linear Programming , 1992 .

[57]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[58]  Yingqian Zhang,et al.  Learning Optimal Classification Trees Using a Binary Linear Program Formulation , 2019, BNAIC/BENELEARN.

[59]  Andrew McCallum,et al.  Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space , 2019, KDD.

[60]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[61]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[62]  C. Zălinescu Convex analysis in general vector spaces , 2002 .

[63]  TreesKristin P. Bennett,et al.  Optimal Decision Trees , 1996 .

[64]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[65]  Alexander M. Rush,et al.  Unsupervised Recurrent Neural Network Grammars , 2019, NAACL.

[66]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[67]  Alexander M. Rush,et al.  Compound Probabilistic Context-Free Grammars for Grammar Induction , 2019, ACL.

[68]  Andreas Krause,et al.  Differentiable Learning of Submodular Models , 2017, NIPS 2017.

[69]  Yongxin Yang,et al.  Deep Neural Decision Trees , 2018, ArXiv.

[70]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[71]  Oktay Günlük,et al.  Optimal decision trees for categorical data via integer programming , 2021, Journal of Global Optimization.

[72]  Margo I. Seltzer,et al.  Optimal Sparse Decision Trees , 2019, NeurIPS.

[73]  E. Xing,et al.  Exact Algorithms for Isotonic Regression and Related , 2016 .

[74]  Benjamin Moseley,et al.  Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search , 2017, NIPS.

[75]  J. Leeuw,et al.  Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods , 2009 .

[76]  Stephen Clark,et al.  Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs , 2017, Natural Language Engineering.

[77]  Sergei Popov,et al.  Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data , 2019, ICLR.