GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective
暂无分享,去创建一个
Jindong Wang | Yue Zhang | Linyi Yang | Yidong Wang | Xingxu Xie | Yafu Li | Shuibai Zhang | Libo Qin | Hanmeng Liu
[1] Jindong Wang,et al. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective , 2023, ArXiv.
[2] Ari S. Morcos,et al. The Robustness Limits of SoTA Vision Models to Natural Variation , 2022, ArXiv.
[3] Yejin Choi,et al. NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation , 2022, EMNLP.
[4] Kevin Leach,et al. Evaluating Out-of-Distribution Performance on Document Image Classifiers , 2022, NeurIPS.
[5] Yixuan Li,et al. OpenOOD: Benchmarking Generalized Out-of-Distribution Detection , 2022, NeurIPS.
[6] Arabella J. Sinclair,et al. A taxonomy and review of generalization research in NLP , 2022, Nature Machine Intelligence.
[7] M. Zhou,et al. Pre-Training a Graph Recurrent Network for Language Representation , 2022, ArXiv.
[8] Seong Joon Oh,et al. ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets , 2022, ArXiv.
[9] Linyi Yang,et al. FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition , 2022, COLING.
[10] B. Schiele,et al. Assaying Out-Of-Distribution Generalization in Transfer Learning , 2022, NeurIPS.
[11] Shuiwang Ji,et al. GOOD: A Graph Out-of-Distribution Benchmark , 2022, NeurIPS.
[12] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.
[13] Brian Mac Namee,et al. A Rationale-Centric Framework for Human-in-the-loop Machine Learning , 2022, ACL.
[14] I. Rish,et al. WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series Tasks , 2022, Trans. Mach. Learn. Res..
[15] Ting-Hao 'Kenneth' Huang,et al. Are Shortest Rationales the Best Explanations for Human Understanding? , 2022, ACL.
[16] Swaroop Mishra,et al. Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , 2022, FINDINGS.
[17] Percy Liang,et al. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , 2022, ICLR.
[18] Shuohang Wang,et al. AdaPrompt: Adaptive Model Training for Prompt-based NLP , 2022, EMNLP.
[19] Xuezhi Wang,et al. Measure and Improve Robustness in NLP Models: A Survey , 2021, NAACL.
[20] Antonios Anastasopoulos,et al. Systematic Inequalities in Language Technology Performance across the World’s Languages , 2021, ACL.
[21] R. Salakhutdinov,et al. FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding , 2021, ACL.
[22] Dinh Q. Phung,et al. Domain Generalisation of NMT: Fusing Adapters with Leave-One-Domain-Out Training , 2022, FINDINGS.
[23] Zhe Gan,et al. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.
[24] Udit Arora,et al. Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.
[25] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[26] Guoao Wei,et al. FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.
[27] Y. Gal,et al. Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks , 2021, NeurIPS Datasets and Benchmarks.
[28] Ruihai Dong,et al. Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis , 2021, ACL.
[29] Hongxia Jin,et al. Enhancing the generalization for Intent Classification and Out-of-Domain Detection in SLU , 2021, ACL.
[30] Francesca Toni,et al. Explanation-Based Human Debugging of NLP Models: A Survey , 2021, Transactions of the Association for Computational Linguistics.
[31] Zhiyi Ma,et al. Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.
[32] Cuiling Lan,et al. Generalizing to Unseen Domains: A Survey on Domain Generalization , 2021, IEEE Transactions on Knowledge and Data Engineering.
[33] Jeffrey Heer,et al. Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models , 2021, ACL.
[34] Zhao Wang,et al. Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals , 2020, AAAI.
[35] Pang Wei Koh,et al. WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.
[36] Zachary Chase Lipton,et al. Explaining The Efficacy of Counterfactually-Augmented Data , 2020, ICLR.
[37] Hinrich Schütze,et al. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.
[38] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.
[39] Aaron C. Courville,et al. Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.
[40] Barry Smyth,et al. Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification , 2020, COLING.
[41] Lifu Tu,et al. An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.
[42] Percy Liang,et al. Robustness to Spurious Correlations via Human Annotations , 2020, ICML.
[43] Eric P. Xing,et al. Self-Challenging Improves Cross-Domain Generalization , 2020, ECCV.
[44] Michael I. Jordan,et al. On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.
[45] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[46] Pang Wei Koh,et al. An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.
[47] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.
[48] Sen Wu,et al. Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.
[49] Monojit Choudhury,et al. GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.
[50] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.
[51] Dian Yu,et al. CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.
[52] Yoav Goldberg,et al. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.
[53] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.
[54] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[55] Xipeng Qiu,et al. Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.
[56] Byron C. Wallace,et al. ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2019, ACL.
[57] X. Xue,et al. Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models , 2019, ICLR.
[58] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.
[59] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[60] Zachary Chase Lipton,et al. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.
[61] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[62] Minlie Huang,et al. Out-of-Domain Detection for Natural Language Understanding in Dialog Systems , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[63] Jianmo Ni,et al. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.
[64] Daniel C. Castro,et al. Domain Generalization via Model-Agnostic Learning of Semantic Features , 2019, NeurIPS.
[65] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[66] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[67] Hung-Yu Kao,et al. Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.
[68] Sameer Singh,et al. Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.
[69] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[70] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[71] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.
[72] Rada Mihalcea,et al. Multi-Label Transfer Learning for Multi-Relational Semantic Similarity , 2018, *SEMEVAL.
[73] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[74] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[75] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.
[76] Swami Sankaranarayanan,et al. MetaReg: Towards Domain Generalization using Meta-Regularization , 2018, NeurIPS.
[77] Zachary C. Lipton,et al. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.
[78] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[79] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.
[80] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[81] Zhiguo Wang,et al. Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.
[82] Philip Bachman,et al. NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.
[83] Regina Barzilay,et al. Rationalizing Neural Predictions , 2016, EMNLP.
[84] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.
[85] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.
[86] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[87] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.
[88] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[89] Jacob Cohen,et al. The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .
[90] M. Friedman. A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .