ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data

Though deep neural networks have gained enormous successes in various fields (e.g., computer vision) with supervised learning, they have so far been still trailing after the performances of GBDTs on tabular data. Delving into this task, we determine that a judicious handling of feature interactions and feature representation is crucial to the effectiveness of neural networks on tabular data. We develop a novel neural network called ExcelFormer, which alternates in turn between two attention modules that shrewdly manipulate feature interactions and feature representation updates, respectively. A bespoke training methodology is jointly introduced to facilitate model performances. Specifically, by initializing parameters with minuscule values, these attention modules are attenuated when the training begins, and the effects of feature interactions and representation updates grow progressively up to optimum levels under the guidance of our proposed specific regularization schemes Feat-Mix and Hidden-Mix as the training proceeds. Experiments on 28 public tabular datasets show that our ExcelFormer approach is superior to extensively-tuned GBDTs, which is an unprecedented progress of deep neural networks on supervised tabular learning.

[1]  Jian Wu,et al.  T2G-Former: Organizing Tabular Features into Relation Graphs Promotes Heterogeneous Feature Interaction , 2022, AAAI.

[2]  Jimeng Sun,et al.  TransTab: Learning Transferable Tabular Transformers Across Tables , 2022, NeurIPS.

[3]  Artem Babenko,et al.  On Embeddings for Numerical Features in Tabular Deep Learning , 2022, NeurIPS.

[4]  D. Chen,et al.  DANets: Deep Abstract Networks for Tabular Data Classification and Regression , 2021, AAAI.

[5]  Gjergji Kasneci,et al.  Deep Neural Networks and Tabular Data: A Survey , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Jianyong Wang,et al.  Scalable Rule-Based Representation Learning for Interpretable Classification , 2021, NeurIPS.

[7]  Majid Sarrafzadeh,et al.  Contrastive Mixup: Self- and Semi-Supervised learning for Tabular Domain , 2021, ArXiv.

[8]  Artem Babenko,et al.  Revisiting Deep Learning Models for Tabular Data , 2021, NeurIPS.

[9]  Josif Grabocka,et al.  Well-tuned Simple Nets Excel on Tabular Datasets , 2021, NeurIPS.

[10]  Micah Goldblum,et al.  SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training , 2021, ArXiv.

[11]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[13]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[14]  Jie Qin,et al.  ResizeMix: Mixing Data with Preserved Object Information and True Labels , 2020, ArXiv.

[15]  Zhenguo Li,et al.  An Embedding Learning Framework for Numerical Features in CTR Prediction , 2020, KDD.

[16]  Ruiwei Feng,et al.  Flow-Mixup: Classifying Multi-labeled Medical Images with Corrupted Labels , 2020, 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[17]  Hyun Oh Song,et al.  Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup , 2020, ICML.

[18]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[19]  TaeChoong Chung,et al.  SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization , 2020, ICLR.

[20]  Garrison W. Cottrell,et al.  ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[21]  Jianyong Wang,et al.  Transparent Classification with Multilayer Logical Perceptrons and Random Binarization , 2019, AAAI.

[22]  Andrew Y. Ng,et al.  NGBoost: Natural Gradient Boosting for Probabilistic Prediction , 2019, ICML.

[23]  Sergei Popov,et al.  Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data , 2019, ICLR.

[24]  Xiaowei Ding,et al.  Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation , 2019, Medical Image Anal..

[25]  Sercan Ö. Arik,et al.  TabNet: Attentive Interpretable Tabular Learning , 2019, AAAI Conference on Artificial Intelligence.

[26]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[27]  Chenglong Li,et al.  Multi-Branch Context-Aware Network for Person Re-Identification , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[28]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[30]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[31]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[33]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[34]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[37]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[38]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[39]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[40]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[41]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[43]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[44]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[45]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Jian Wu,et al.  TabCaps: A Capsule Neural Network for Tabular Data Classification with BoW Routing , 2023, ICLR.

[47]  G. Varoquaux,et al.  Why do tree-based models still outperform deep learning on typical tabular data? , 2022, NeurIPS.

[48]  Ran El-Yaniv,et al.  Net-DNF: Effective Deep Modeling of Tabular Data , 2021, ICLR.

[49]  A Hyper-Parameter Tuning , 2021 .

[50]  Mihaela van der Schaar,et al.  VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain , 2020, NeurIPS.

[51]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .