Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data

Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.

[1]  Nathan Lay,et al.  Random Hinge Forest for Differentiable Learning , 2018, ArXiv.

[2]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xin Guo,et al.  Sparsemax and Relaxed Wasserstein for Topic Sparsity , 2018, WSDM.

[4]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[5]  Tie-Yan Liu,et al.  TabNN: A Universal Neural Network Solution for Tabular Data , 2018 .

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[9]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[11]  Eran Segal,et al.  Regularization Learning Networks , 2018, NeurIPS.

[12]  Jerry Ma,et al.  Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.

[13]  Yongxin Yang,et al.  Deep Neural Decision Trees , 2018, ArXiv.

[14]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[17]  Ron Kohavi,et al.  Bottom-Up Induction of Oblivious Read-Once Decision Graphs: Strengths and Limitations , 1994, AAAI.

[18]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[19]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[20]  André F. T. Martins,et al.  Sparse Sequence-to-Sequence Models , 2019, ACL.

[21]  Yin Lou,et al.  BDT: Gradient Boosted Decision Tables for High Accuracy and Scoring Efficiency , 2017, KDD.

[22]  Ji Feng,et al.  Multi-Layered Gradient Boosting Decision Trees , 2018, NeurIPS.

[23]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[24]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Ji Feng,et al.  Deep Forest: Towards An Alternative to Deep Neural Networks , 2017, IJCAI.

[27]  Kevin Miller,et al.  Forward Thinking: Building Deep Random Forests , 2017, ArXiv.

[28]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[29]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.