论文信息 - GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective - 字舞流文

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.

Jindong Wang | Yue Zhang | Linyi Yang | Yidong Wang | Xingxu Xie | Yafu Li | Shuibai Zhang | Libo Qin | Hanmeng Liu

[1] Jindong Wang,et al. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective , 2023, ArXiv.

[2] Ari S. Morcos,et al. The Robustness Limits of SoTA Vision Models to Natural Variation , 2022, ArXiv.

[3] Yejin Choi,et al. NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation , 2022, EMNLP.

[4] Kevin Leach,et al. Evaluating Out-of-Distribution Performance on Document Image Classifiers , 2022, NeurIPS.

[5] Yixuan Li,et al. OpenOOD: Benchmarking Generalized Out-of-Distribution Detection , 2022, NeurIPS.

[6] Arabella J. Sinclair,et al. A taxonomy and review of generalization research in NLP , 2022, Nature Machine Intelligence.

[7] M. Zhou,et al. Pre-Training a Graph Recurrent Network for Language Representation , 2022, ArXiv.

[8] Seong Joon Oh,et al. ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets , 2022, ArXiv.

[9] Linyi Yang,et al. FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition , 2022, COLING.

[10] B. Schiele,et al. Assaying Out-Of-Distribution Generalization in Transfer Learning , 2022, NeurIPS.

[11] Shuiwang Ji,et al. GOOD: A Graph Out-of-Distribution Benchmark , 2022, NeurIPS.

[12] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[13] Brian Mac Namee,et al. A Rationale-Centric Framework for Human-in-the-loop Machine Learning , 2022, ACL.

[14] I. Rish,et al. WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series Tasks , 2022, Trans. Mach. Learn. Res..

[15] Ting-Hao 'Kenneth' Huang,et al. Are Shortest Rationales the Best Explanations for Human Understanding? , 2022, ACL.

[16] Swaroop Mishra,et al. Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , 2022, FINDINGS.

[17] Percy Liang,et al. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , 2022, ICLR.

[18] Shuohang Wang,et al. AdaPrompt: Adaptive Model Training for Prompt-based NLP , 2022, EMNLP.

[19] Xuezhi Wang,et al. Measure and Improve Robustness in NLP Models: A Survey , 2021, NAACL.

[20] Zhe Gan,et al. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.

[21] Antonios Anastasopoulos,et al. Systematic Inequalities in Language Technology Performance across the World’s Languages , 2021, ACL.

[22] R. Salakhutdinov,et al. FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding , 2021, ACL.

[23] Udit Arora,et al. Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.

[24] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[25] Guoao Wei,et al. FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.

[26] Y. Gal,et al. Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks , 2021, NeurIPS Datasets and Benchmarks.

[27] Ruihai Dong,et al. Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis , 2021, ACL.

[28] Hongxia Jin,et al. Enhancing the generalization for Intent Classification and Out-of-Domain Detection in SLU , 2021, ACL.

[29] Francesca Toni,et al. Explanation-Based Human Debugging of NLP Models: A Survey , 2021, Transactions of the Association for Computational Linguistics.

[30] Zhiyi Ma,et al. Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[31] Cuiling Lan,et al. Generalizing to Unseen Domains: A Survey on Domain Generalization , 2021, IEEE Transactions on Knowledge and Data Engineering.

[32] Jeffrey Heer,et al. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models , 2021, ACL.

[33] Zhao Wang,et al. Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals , 2020, AAAI.

[34] Pang Wei Koh,et al. WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[35] Barry Smyth,et al. Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification , 2020, COLING.

[36] Zachary Chase Lipton,et al. Explaining The Efficacy of Counterfactually-Augmented Data , 2020, ICLR.

[37] Hinrich Schütze,et al. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[38] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[39] Lifu Tu,et al. An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.

[40] Percy Liang,et al. Robustness to Spurious Correlations via Human Annotations , 2020, ICML.

[41] Eric P. Xing,et al. Self-Challenging Improves Cross-Domain Generalization , 2020, ECCV.

[42] Michael I. Jordan,et al. On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[43] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[44] Pang Wei Koh,et al. An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.

[45] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[46] Sen Wu,et al. Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.

[47] Monojit Choudhury,et al. GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.

[48] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[49] Dian Yu,et al. CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[50] Yoav Goldberg,et al. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[51] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[52] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[53] Xipeng Qiu,et al. Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[54] Aaron C. Courville,et al. Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[55] Byron C. Wallace,et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2019, ACL.

[56] X. Xue,et al. Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2019, ICLR.

[57] Jianmo Ni,et al. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[58] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[59] Daniel C. Castro,et al. Domain Generalization via Model-Agnostic Learning of Semantic Features , 2019, NeurIPS.

[60] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[61] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[62] Zachary Chase Lipton,et al. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[63] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[64] Minlie Huang,et al. Out-of-Domain Detection for Natural Language Understanding in Dialog Systems , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[65] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[66] Hung-Yu Kao,et al. Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[67] Sameer Singh,et al. Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[68] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[69] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[70] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[71] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[72] Swami Sankaranarayanan,et al. MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.

[73] Zachary C. Lipton,et al. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[74] Rada Mihalcea,et al. Multi-Label Transfer Learning for Multi-Relational Semantic Similarity , 2018, *SEMEVAL.

[75] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[76] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[77] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[78] Zhiguo Wang,et al. Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[79] Philip Bachman,et al. NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[80] Regina Barzilay,et al. Rationalizing Neural Predictions , 2016, EMNLP.

[81] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[82] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[83] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[84] Jacob Cohen,et al. The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[85] M. Friedman. A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[86] Dinh Q. Phung,et al. Domain Generalisation of NMT: Fusing Adapters with Leave-One-Domain-Out Training , 2022, FINDINGS.

[87] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[88] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[89] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[90] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.