GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective
暂无分享,去创建一个
Jindong Wang | Yue Zhang | Linyi Yang | Yidong Wang | Xingxu Xie | Yafu Li | Shuibai Zhang | Libo Qin | Hanmeng Liu
[1] Jindong Wang,et al. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective , 2023, ArXiv.
[2] Ari S. Morcos,et al. The Robustness Limits of SoTA Vision Models to Natural Variation , 2022, ArXiv.
[3] Yejin Choi,et al. NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation , 2022, EMNLP.
[4] Kevin Leach,et al. Evaluating Out-of-Distribution Performance on Document Image Classifiers , 2022, NeurIPS.
[5] Yixuan Li,et al. OpenOOD: Benchmarking Generalized Out-of-Distribution Detection , 2022, NeurIPS.
[6] Arabella J. Sinclair,et al. A taxonomy and review of generalization research in NLP , 2022, Nature Machine Intelligence.
[7] M. Zhou,et al. Pre-Training a Graph Recurrent Network for Language Representation , 2022, ArXiv.
[8] Seong Joon Oh,et al. ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets , 2022, ArXiv.
[9] Linyi Yang,et al. FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition , 2022, COLING.
[10] B. Schiele,et al. Assaying Out-Of-Distribution Generalization in Transfer Learning , 2022, NeurIPS.
[11] Shuiwang Ji,et al. GOOD: A Graph Out-of-Distribution Benchmark , 2022, NeurIPS.
[12] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.
[13] Brian Mac Namee,et al. A Rationale-Centric Framework for Human-in-the-loop Machine Learning , 2022, ACL.
[14] I. Rish,et al. WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series Tasks , 2022, Trans. Mach. Learn. Res..
[15] Ting-Hao 'Kenneth' Huang,et al. Are Shortest Rationales the Best Explanations for Human Understanding? , 2022, ACL.
[16] Swaroop Mishra,et al. Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , 2022, FINDINGS.
[17] Percy Liang,et al. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , 2022, ICLR.
[18] Shuohang Wang,et al. AdaPrompt: Adaptive Model Training for Prompt-based NLP , 2022, EMNLP.
[19] Xuezhi Wang,et al. Measure and Improve Robustness in NLP Models: A Survey , 2021, NAACL.
[20] Zhe Gan,et al. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.
[21] Antonios Anastasopoulos,et al. Systematic Inequalities in Language Technology Performance across the World’s Languages , 2021, ACL.
[22] R. Salakhutdinov,et al. FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding , 2021, ACL.
[23] Udit Arora,et al. Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.
[24] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[25] Guoao Wei,et al. FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.
[26] Y. Gal,et al. Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks , 2021, NeurIPS Datasets and Benchmarks.
[27] Ruihai Dong,et al. Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis , 2021, ACL.
[28] Hongxia Jin,et al. Enhancing the generalization for Intent Classification and Out-of-Domain Detection in SLU , 2021, ACL.
[29] Francesca Toni,et al. Explanation-Based Human Debugging of NLP Models: A Survey , 2021, Transactions of the Association for Computational Linguistics.
[30] Zhiyi Ma,et al. Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.
[31] Cuiling Lan,et al. Generalizing to Unseen Domains: A Survey on Domain Generalization , 2021, IEEE Transactions on Knowledge and Data Engineering.
[32] Jeffrey Heer,et al. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models , 2021, ACL.
[33] Zhao Wang,et al. Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals , 2020, AAAI.
[34] Pang Wei Koh,et al. WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.
[35] Barry Smyth,et al. Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification , 2020, COLING.
[36] Zachary Chase Lipton,et al. Explaining The Efficacy of Counterfactually-Augmented Data , 2020, ICLR.
[37] Hinrich Schütze,et al. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.
[38] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.
[39] Lifu Tu,et al. An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.
[40] Percy Liang,et al. Robustness to Spurious Correlations via Human Annotations , 2020, ICML.
[41] Eric P. Xing,et al. Self-Challenging Improves Cross-Domain Generalization , 2020, ECCV.
[42] Michael I. Jordan,et al. On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.
[43] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[44] Pang Wei Koh,et al. An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.
[45] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.
[46] Sen Wu,et al. Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.
[47] Monojit Choudhury,et al. GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.
[48] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.
[49] Dian Yu,et al. CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.
[50] Yoav Goldberg,et al. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.
[51] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.
[52] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[53] Xipeng Qiu,et al. Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.
[54] Aaron C. Courville,et al. Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.
[55] Byron C. Wallace,et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2019, ACL.
[56] X. Xue,et al. Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2019, ICLR.
[57] Jianmo Ni,et al. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.
[58] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.
[59] Daniel C. Castro,et al. Domain Generalization via Model-Agnostic Learning of Semantic Features , 2019, NeurIPS.
[60] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[61] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[62] Zachary Chase Lipton,et al. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.
[63] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[64] Minlie Huang,et al. Out-of-Domain Detection for Natural Language Understanding in Dialog Systems , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[65] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[66] Hung-Yu Kao,et al. Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.
[67] Sameer Singh,et al. Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.
[68] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[69] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[70] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.
[71] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.
[72] Swami Sankaranarayanan,et al. MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.
[73] Zachary C. Lipton,et al. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.
[74] Rada Mihalcea,et al. Multi-Label Transfer Learning for Multi-Relational Semantic Similarity , 2018, *SEMEVAL.
[75] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[76] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.
[77] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[78] Zhiguo Wang,et al. Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.
[79] Philip Bachman,et al. NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.
[80] Regina Barzilay,et al. Rationalizing Neural Predictions , 2016, EMNLP.
[81] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.
[82] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.
[83] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[84] Jacob Cohen,et al. The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .
[85] M. Friedman. A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .
[86] Dinh Q. Phung,et al. Domain Generalisation of NMT: Fusing Adapters with Leave-One-Domain-Out Training , 2022, FINDINGS.
[87] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[88] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[89] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.
[90] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.