Benchmarking Multimodal AutoML for Tabular Data with Text Fields

We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark2 enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology3 that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle’s Mercari Price Suggestion Challenge.

[1]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[2]  You Wu,et al.  TURL , 2020, Proc. VLDB Endow..

[3]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[4]  Ronghang Hu,et al.  UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Abien Fred Agarap,et al.  Statistical Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network (RNN) , 2018, ArXiv.

[6]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[9]  Yotam Shmargad,et al.  How Algorithms Discriminate Based on Data They Lack: Challenges, Solutions, and Policy Implications , 2018, Journal of Information Policy.

[10]  Aaron Klein,et al.  Model-based Asynchronous Hyperparameter Optimization , 2020, ArXiv.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Rhema Linder,et al.  A Neophyte With AutoML: Evaluating the Promises of Automatic Machine Learning Tools , 2021, ArXiv.

[13]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[14]  Annabelle McIver,et al.  Generalised Differential Privacy for Text Document Processing , 2018, POST.

[15]  He He,et al.  GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[16]  Cheng Guo,et al.  Entity Embeddings of Categorical Variables , 2016, ArXiv.

[17]  Samuel R. Bowman,et al.  Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set , 2019, EMNLP.

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[20]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Matteo Negri,et al.  Low Resource Neural Machine Translation: A Benchmark for Five African Languages , 2020, AfricaNLP.

[23]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[24]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25]  Xin Huang,et al.  TabTransformer: Tabular Data Modeling Using Contextual Embeddings , 2020, ArXiv.

[26]  Marco F. Huber,et al.  Benchmark and Survey of Automated Machine Learning Frameworks , 2019, J. Artif. Intell. Res..

[27]  Paulo Cortez,et al.  A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News , 2015, EPIA.

[28]  Hang Zhang,et al.  AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data , 2020, ArXiv.

[29]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[30]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[31]  Tie-Yan Liu,et al.  DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks , 2019, KDD.

[32]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[33]  Heike Adel,et al.  A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios , 2020, NAACL.

[34]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[36]  Kaiyong Zhao,et al.  AutoML: A Survey of the State-of-the-Art , 2019, Knowl. Based Syst..

[37]  Maximilien Kintz,et al.  Leveraging Automated Machine Learning for Text Classification: Evaluation of AutoML Tools and Comparison with Human Performance , 2020, ICAART.

[38]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[41]  Nicolas Audebert,et al.  Multimodal deep networks for text and image-based document classification , 2019, PKDD/ECML Workshops.

[42]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[43]  Alexander J. Smola,et al.  Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation , 2020, NeurIPS.

[44]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[45]  Georgios Kambourakis,et al.  Automatic Detection of Online Recruitment Frauds: Characteristics, Methods, and a Public Dataset , 2017, Future Internet.

[46]  Lisa Dunlap,et al.  NBDT: Neural-Backed Decision Tree , 2021, ICLR.

[47]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[48]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[49]  J. Mixter Fast , 2012 .

[50]  Michael Larionov Sampling Techniques in Bayesian Target Encoding , 2020, ArXiv.

[51]  Reza Farivar,et al.  Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools , 2019, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI).

[52]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[53]  Lifu Tu,et al.  An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.

[54]  E. LeDell,et al.  H2O AutoML: Scalable Automatic Machine Learning , 2020 .

[55]  Isabelle Guyon,et al.  AutoML @ NeurIPS 2018 challenge: Design and Results , 2019, ArXiv.

[56]  Ting Wang,et al.  TextBugger: Generating Adversarial Text Against Real-world Applications , 2018, NDSS.

[57]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[58]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[59]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[60]  Aaron Klein,et al.  Auto-sklearn: Efficient and Robust Automated Machine Learning , 2019, Automated Machine Learning.

[61]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[63]  Dong Yu,et al.  Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features , 2016, KDD.

[64]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[65]  Qingquan Song,et al.  Auto-Keras: An Efficient Neural Architecture Search System , 2018, KDD.

[66]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[67]  Alexander J. Smola,et al.  TraDE: Transformers for Density Estimation , 2020, ArXiv.

[68]  Bernd Bischl,et al.  An Open Source AutoML Benchmark , 2019, ArXiv.