Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. At the heart of our approach is a data augmentation strategy based on Gibbs sampling from a self-attention pseudolikelihood estimator. Across 30 datasets spanning regression and binary/multiclass classification tasks, FAST-DAD distillation produces significantly better individual models than one obtains through standard training on the original data. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.

[1]  Reza Farivar,et al.  Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[2]  Aaron Klein,et al.  Auto-sklearn: Efficient and Robust Automated Machine Learning , 2019, Automated Machine Learning.

[3]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[4]  P. Diaconis,et al.  Gibbs sampling, conjugate priors and coupling , 2010 .

[5]  L. Breiman,et al.  BORN AGAIN TREES , 1996 .

[6]  Siddhartha V. Jayanti,et al.  Learning from weakly dependent data under Dobrushin's condition , 2019, COLT.

[7]  Alexander J. Smola,et al.  TraDE: Transformers for Density Estimation , 2020, ArXiv.

[8]  Maximilian Schiffer,et al.  Born-Again Tree Ensembles , 2020, ICML.

[9]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Jeremy Howard,et al.  fastai: A Layered API for Deep Learning , 2020, Inf..

[12]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[13]  S. Srihari Mixture Density Networks , 1994 .

[14]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[15]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[18]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[19]  Jonas Mueller,et al.  Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep Ensembles , 2019, AAAI.

[20]  Matthew Richardson,et al.  Do Deep Convolutional Nets Really Need to be Deep and Convolutional? , 2016, ICLR.

[21]  Erwan Scornet,et al.  Neural Random Forests , 2016, Sankhya A.

[22]  Liming Wu,et al.  Convergence rate and concentration inequalities for Gibbs sampling in high dimension , 2014 .

[23]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[24]  Yves Raimond,et al.  Gradient Boosted Decision Tree Neural Network , 2019, ArXiv.

[25]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[26]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Tie-Yan Liu,et al.  TabNN: A Universal Neural Network Solution for Tabular Data , 2018 .

[29]  Adrian F. M. Smith,et al.  Simple conditions for the convergence of the Gibbs sampler and Metropolis-Hastings algorithms , 1994 .

[30]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  J. Besag Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[33]  A. Owen Nonparametric Conditional Estimation , 2018 .

[34]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[35]  Bernd Bischl,et al.  An Open Source AutoML Benchmark , 2019, ArXiv.

[36]  Lester W. Mackey,et al.  Teacher-Student Compression with Generative Adversarial Networks , 2018, 1812.02271.

[37]  Cheng Guo,et al.  Entity Embeddings of Categorical Variables , 2016, ArXiv.

[38]  Georgios Tzimiropoulos,et al.  Knowledge distillation via adaptive instance normalization , 2020, ArXiv.

[39]  Sergei Popov,et al.  Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data , 2019, ICLR.

[40]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[41]  Mehryar Mohri,et al.  AdaNet: Adaptive Structural Learning of Artificial Neural Networks , 2016, ICML.

[42]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[43]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[44]  David Haussler,et al.  Probably Approximately Correct Learning , 2010, Encyclopedia of Machine Learning.

[45]  Sergio Escalera,et al.  Analysis of the AutoML Challenge Series 2015-2018 , 2019, Automated Machine Learning.

[46]  Jishnu Mukhoti,et al.  On the Importance of Strong Baselines in Bayesian Deep Learning , 2018, ArXiv.

[47]  Mark R. Segal,et al.  Multivariate random forests , 2011, WIREs Data Mining Knowl. Discov..

[48]  Charlie Nash,et al.  Autoregressive Energy Machines , 2019, ICML.

[49]  Emiel Hoogeboom,et al.  Learning Discrete Distributions by Dequantization , 2020, ArXiv.

[50]  Tie-Yan Liu,et al.  DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks , 2019, KDD.

[51]  Alex Wang,et al.  BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[52]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[53]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Eduard H. Hovy,et al.  MaCow: Masked Convolutional Generative Flow , 2019, NeurIPS.

[55]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[56]  Takafumi Kanamori,et al.  Density Ratio Estimation in Machine Learning , 2012 .

[57]  Hassan Ghasemzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher , 2019, ArXiv.

[58]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[59]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Hang Zhang,et al.  AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data , 2020, ArXiv.