TabNet: Attentive Interpretable Tabular Learning

We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant.

[1]  Norbert Jankowski,et al.  Feature selection with decision tree criterion , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[2]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[3]  Quoc V. Le,et al.  Selfie: Self-supervised Pretraining for Image Embedding , 2019, ArXiv.

[4]  Stefan Schaal,et al.  Locally Weighted Projection Regression : An O(n) Algorithm for Incremental Real Time Learning in High Dimensional Space , 2000 .

[5]  Fan Yang,et al.  Good Semi-supervised Learning That Requires a Bad GAN , 2017, NIPS.

[6]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[7]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[8]  Tiejun Zhao,et al.  Text Generation From Tables , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10]  Tie-Yan Liu,et al.  TabNN: A Universal Neural Network Solution for Tabular Data , 2018 .

[11]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[12]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[13]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[14]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[15]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[16]  Mihaela van der Schaar,et al.  INVASE: Instance-wise Variable Selection using Neural Networks , 2018, ICLR.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[19]  Yongxin Yang,et al.  Deep Neural Decision Trees , 2018, ArXiv.

[20]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[23]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[24]  Eran Segal,et al.  Regularization Learning Networks , 2018, NeurIPS.

[25]  Zhifang Sui,et al.  Table-to-text Generation by Structure-aware Seq2seq Learning , 2017, AAAI.

[26]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[28]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[30]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[31]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[35]  Jian Tang,et al.  AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks , 2018, CIKM.

[36]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[37]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[40]  Sebastian Nowozin,et al.  EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE , 2018, ICML.

[41]  Scott M. Lundberg,et al.  Consistent Individualized Feature Attribution for Tree Ensembles , 2018, ArXiv.

[42]  Mehryar Mohri,et al.  AdaNet: Adaptive Structural Learning of Artificial Neural Networks , 2016, ICML.

[43]  Alex Mott,et al.  S3TA: A Soft, Spatial, Sequential, Top-Down Attention Model , 2018 .

[44]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[45]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[46]  Antonio Criminisi,et al.  Adaptive Neural Trees , 2018, ICML.

[47]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[48]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[49]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[50]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[51]  Eibe Frank,et al.  XGBoost: Scalable GPU Accelerated Learning , 2018, ArXiv.

[52]  Le Song,et al.  Learning to Explain: An Information-Theoretic Perspective on Model Interpretation , 2018, ICML.

[53]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[54]  John W. Paisley,et al.  Global Explanations of Neural Networks: Mapping the Landscape of Predictions , 2019, AIES.

[55]  J. L. Peterson,et al.  Deep Neural Network Initialization With Decision Trees , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[56]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[57]  Yann LeCun,et al.  Very Deep Convolutional Networks for Natural Language Processing , 2016, ArXiv.

[58]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[59]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[60]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[61]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[62]  Charu C. Aggarwal,et al.  Using a Random Forest to Inspire a Neural Network and Improving on It , 2017, SDM.