Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data

Selecting suitable architecture parameters and training hyperparameters is essential for enhancing machine learning (ML) model performance. Several recent empirical studies conduct large-scale correlational analysis on neural networks (NNs) to search for effective \emph{generalization metrics} that can guide this type of model selection. Effective metrics are typically expected to correlate strongly with test performance. In this paper, we expand on prior analyses by examining generalization-metric-based model selection with the following objectives: (i) focusing on natural language processing (NLP) tasks, as prior work primarily concentrates on computer vision (CV) tasks; (ii) considering metrics that directly predict \emph{test error} instead of the \emph{generalization gap}; (iii) exploring metrics that do not need access to data to compute. From these objectives, we are able to provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics. Our analyses consider (I) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (II) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including GPT2, BERT, etc., and (III) a total of 28 existing and novel generalization metrics. Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks, exhibiting stronger correlations than other, more popular metrics. To further examine these metrics, we extend prior formulations relying on power law (PL) spectral distributions to exponential (EXP) and exponentially-truncated power law (E-TPL) families.

[1]  Kannan Ramchandran,et al.  Taxonomizing local versus global structure in neural network loss landscapes , 2021, NeurIPS.

[2]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[3]  Michael W. Mahoney,et al.  Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics , 2021, ArXiv.

[4]  Xinyu Gong,et al.  Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective , 2021, ICLR.

[5]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[6]  Hossein Mobahi,et al.  NeurIPS 2020 Competition: Predicting Generalization in Deep Learning , 2020, ArXiv.

[7]  Ioannis Mitliagkas,et al.  In Search of Robust Measures of Generalization , 2020, NeurIPS.

[8]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[9]  Yelong Shen,et al.  A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation , 2020, ArXiv.

[10]  Michael W. Mahoney,et al.  Boundary thickness and robustness in learning models , 2020, NeurIPS.

[11]  Michael W. Mahoney,et al.  Multiplicative noise and heavy tails in stochastic optimization , 2020, ICML.

[12]  Guokun Lai,et al.  Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.

[13]  Colin Wei,et al.  Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin , 2020, ICLR.

[14]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[15]  Michael W. Mahoney,et al.  Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data , 2020, Nature Communications.

[16]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[17]  Tie-Yan Liu,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[18]  B. Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2019, LREC.

[19]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[20]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[21]  Jianfeng Gao,et al.  DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation , 2019, ACL.

[22]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[24]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[25]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[26]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[27]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[29]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[31]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[32]  Michael W. Mahoney,et al.  Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks , 2019, SDM.

[33]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[34]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..

[35]  Hossein Mobahi,et al.  Predicting the Generalization Gap in Deep Networks with Margin Distributions , 2018, ICLR.

[36]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[37]  Vitaly Shmatikov,et al.  How To Backdoor Federated Learning , 2018, AISTATS.

[38]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[39]  Hossein Mobahi,et al.  Large Margin Deep Networks for Classification , 2018, NeurIPS.

[40]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[41]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[42]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[43]  Pierre Vandergheynst,et al.  PAC-BAYESIAN MARGIN BOUNDS FOR CONVOLUTIONAL NEURAL NETWORKS , 2018 .

[44]  Michael W. Mahoney,et al.  Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior , 2017, ArXiv.

[45]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[46]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[47]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[50]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[51]  Lewis D. Griffin,et al.  A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples , 2016, ArXiv.

[52]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[53]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[54]  D. Plenz,et al.  powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions , 2013, PloS one.

[55]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[56]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[57]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[58]  Rajiv Khanna,et al.  Generalization Properties of Stochastic Optimizers via Trajectory Analysis , 2021, ArXiv.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[61]  Hang Li,et al.  Deep learning for natural language processing: advantages and challenges , 2018 .

[62]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[63]  Janet E. Rogers,et al.  Orthogonal Distance Regression ∗ , 2009 .