GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.

[1]  Jindong Wang,et al.  On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective , 2023, ArXiv.

[2]  Ari S. Morcos,et al.  The Robustness Limits of SoTA Vision Models to Natural Variation , 2022, ArXiv.

[3]  Yejin Choi,et al.  NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation , 2022, EMNLP.

[4]  Kevin Leach,et al.  Evaluating Out-of-Distribution Performance on Document Image Classifiers , 2022, NeurIPS.

[5]  Yixuan Li,et al.  OpenOOD: Benchmarking Generalized Out-of-Distribution Detection , 2022, NeurIPS.

[6]  Arabella J. Sinclair,et al.  A taxonomy and review of generalization research in NLP , 2022, Nature Machine Intelligence.

[7]  M. Zhou,et al.  Pre-Training a Graph Recurrent Network for Language Representation , 2022, ArXiv.

[8]  Seong Joon Oh,et al.  ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets , 2022, ArXiv.

[9]  Linyi Yang,et al.  FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition , 2022, COLING.

[10]  B. Schiele,et al.  Assaying Out-Of-Distribution Generalization in Transfer Learning , 2022, NeurIPS.

[11]  Shuiwang Ji,et al.  GOOD: A Graph Out-of-Distribution Benchmark , 2022, NeurIPS.

[12]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[13]  Brian Mac Namee,et al.  A Rationale-Centric Framework for Human-in-the-loop Machine Learning , 2022, ACL.

[14]  I. Rish,et al.  WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series Tasks , 2022, Trans. Mach. Learn. Res..

[15]  Ting-Hao 'Kenneth' Huang,et al.  Are Shortest Rationales the Best Explanations for Human Understanding? , 2022, ACL.

[16]  Swaroop Mishra,et al.  Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , 2022, FINDINGS.

[17]  Percy Liang,et al.  Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , 2022, ICLR.

[18]  Shuohang Wang,et al.  AdaPrompt: Adaptive Model Training for Prompt-based NLP , 2022, EMNLP.

[19]  Xuezhi Wang,et al.  Measure and Improve Robustness in NLP Models: A Survey , 2021, NAACL.

[20]  Zhe Gan,et al.  Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.

[21]  Antonios Anastasopoulos,et al.  Systematic Inequalities in Language Technology Performance across the World’s Languages , 2021, ACL.

[22]  R. Salakhutdinov,et al.  FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding , 2021, ACL.

[23]  Udit Arora,et al.  Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.

[24]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[25]  Guoao Wei,et al.  FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.

[26]  Y. Gal,et al.  Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks , 2021, NeurIPS Datasets and Benchmarks.

[27]  Ruihai Dong,et al.  Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis , 2021, ACL.

[28]  Hongxia Jin,et al.  Enhancing the generalization for Intent Classification and Out-of-Domain Detection in SLU , 2021, ACL.

[29]  Francesca Toni,et al.  Explanation-Based Human Debugging of NLP Models: A Survey , 2021, Transactions of the Association for Computational Linguistics.

[30]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[31]  Cuiling Lan,et al.  Generalizing to Unseen Domains: A Survey on Domain Generalization , 2021, IEEE Transactions on Knowledge and Data Engineering.

[32]  Jeffrey Heer,et al.  Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models , 2021, ACL.

[33]  Zhao Wang,et al.  Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals , 2020, AAAI.

[34]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[35]  Barry Smyth,et al.  Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification , 2020, COLING.

[36]  Zachary Chase Lipton,et al.  Explaining The Efficacy of Counterfactually-Augmented Data , 2020, ICLR.

[37]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[38]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[39]  Lifu Tu,et al.  An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.

[40]  Percy Liang,et al.  Robustness to Spurious Correlations via Human Annotations , 2020, ICML.

[41]  Eric P. Xing,et al.  Self-Challenging Improves Cross-Domain Generalization , 2020, ECCV.

[42]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[43]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[44]  Pang Wei Koh,et al.  An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.

[45]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[46]  Sen Wu,et al.  Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.

[47]  Monojit Choudhury,et al.  GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.

[48]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[49]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[50]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[51]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[52]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[53]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[54]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[55]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2019, ACL.

[56]  X. Xue,et al.  Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2019, ICLR.

[57]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[58]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[59]  Daniel C. Castro,et al.  Domain Generalization via Model-Agnostic Learning of Semantic Features , 2019, NeurIPS.

[60]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[61]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[62]  Zachary Chase Lipton,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[63]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[64]  Minlie Huang,et al.  Out-of-Domain Detection for Natural Language Understanding in Dialog Systems , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[65]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[66]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[67]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[68]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[69]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[70]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[71]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[72]  Swami Sankaranarayanan,et al.  MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.

[73]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[74]  Rada Mihalcea,et al.  Multi-Label Transfer Learning for Multi-Relational Semantic Similarity , 2018, *SEMEVAL.

[75]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[76]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[77]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[78]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[79]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[80]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[81]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[82]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[83]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[84]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[85]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[86]  Dinh Q. Phung,et al.  Domain Generalisation of NMT: Fusing Adapters with Leave-One-Domain-Out Training , 2022, FINDINGS.

[87]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[88]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[89]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[90]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.