暂无分享,去创建一个
[1] Emily Denton,et al. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.
[2] Hinrich Schütze,et al. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.
[3] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.
[4] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.
[5] Akiko Aizawa,et al. Benchmarking Machine Reading Comprehension: A Psychological Perspective , 2021, EACL.
[6] Peter Henderson,et al. With Little Power Comes Great Responsibility , 2020, EMNLP.
[7] Samuel R. Bowman,et al. Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options , 2020, AACL.
[8] Lawrence S. Moss,et al. OCNLI: Original Chinese Natural Language Inference , 2020, FINDINGS.
[9] Daniel Khashabi,et al. UNQOVERing Stereotypical Biases via Underspecified Questions , 2020, FINDINGS.
[10] Samuel R. Bowman,et al. Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.
[11] Emily M. Bender,et al. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.
[12] Solon Barocas,et al. Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.
[13] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[14] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.
[15] Jennifer Chu-Carroll,et al. To Test Machine Comprehension, Start by Defining Comprehension , 2020, ACL.
[16] Samuel R. Bowman,et al. Collecting Entailment Data for Pretraining: New Protocols and Negative Results , 2020, EMNLP.
[17] Jacob Andreas,et al. Experience Grounds Language , 2020, EMNLP.
[18] Yejin Choi,et al. Evaluating Machines by their Real-World Language Use , 2020, ArXiv.
[19] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.
[20] Ronan Le Bras,et al. Adversarial Filters of Dataset Biases , 2020, ICML.
[21] Kentaro Inui,et al. Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.
[22] Jordan L. Boyd-Graber. What Question Answering can Learn from Trivia Nerds , 2019, ACL.
[23] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.
[24] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[25] Ronan Le Bras,et al. WinoGrande , 2019, AAAI.
[26] Dan Jurafsky,et al. Utility is in the Eye of the User: A Critique of NLP Leaderboards , 2020, EMNLP.
[27] Sameer Singh,et al. ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension , 2019, ArXiv.
[28] Lora Aroyo,et al. Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.
[29] Ellie Pavlick,et al. Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.
[30] Jiwei Li,et al. Large-scale Pretraining for Neural Machine Translation with Tens of Billions of Sentence Pairs , 2019, ArXiv.
[31] Roy Schwartz,et al. Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.
[32] Yejin Choi,et al. Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.
[33] Ming-Wei Chang,et al. Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.
[34] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[35] Kenneth Ward Church,et al. A survey of 25 years of evaluation , 2019, Natural Language Engineering.
[36] Hung-Yu Kao,et al. Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.
[37] Udo Kruschwitz,et al. A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation , 2019, NAACL.
[38] Samuel R. Bowman,et al. Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.
[39] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[40] Yoav Goldberg,et al. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.
[41] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.
[42] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[43] Emily M. Bender,et al. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.
[44] Juho Hamari,et al. The Gamification of Work: Lessons From Crowdsourcing , 2018, Journal of Management Inquiry.
[45] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.
[46] Jason Baldridge,et al. Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.
[47] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[48] Carolyn Penstein Rosé,et al. Stress Test Evaluation for Natural Language Inference , 2018, COLING.
[49] Saif Mohammad,et al. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.
[50] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.
[51] Rachel Rudinger,et al. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.
[52] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[53] Masatoshi Tsuchiya,et al. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.
[54] Emily M. Bender,et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.
[55] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.
[56] Chandler May,et al. Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.
[57] Adam Tauman Kalai,et al. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.
[58] Sandro Pezzelle,et al. The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.
[59] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[60] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.
[61] Bruno Guillaume,et al. Creating Zombilingo, a game with a purpose for dependency syntax annotation , 2014, GamifIR '14.
[62] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[63] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
[64] Marjorie Florestal,et al. Is a Burrito a Sandwich? Exploring Race, Class and Culture in Contracts , 2008 .
[65] Laura A. Dabbish,et al. Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.
[66] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .
[67] Stephen Pulman,et al. Using the Framework , 1996 .