What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human– model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.

[1]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[2]  George A. Akerlof,et al.  The Market for “Lemons”: Quality Uncertainty and the Market Mechanism , 1970 .

[3]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[4]  David Boud,et al.  Enhancing learning through self assessment , 1995 .

[5]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[6]  Ido Dagan,et al.  Controlled Crowdsourcing for High-Quality QA-SRL Annotation , 2019, ACL.

[7]  Nanyun Peng,et al.  TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions , 2020, EMNLP.

[8]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[9]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[10]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[11]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[12]  Sebastian Riedel,et al.  Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension , 2020, Transactions of the Association for Computational Linguistics.

[13]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[14]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[15]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[16]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[17]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[18]  Mohit Bansal,et al.  Evaluating Interactive Summarization: an Expansion-Based Framework , 2020, ArXiv.

[19]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[20]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[21]  Mark D. Reckase,et al.  Item Response Theory: Parameter Estimation Techniques , 1998 .

[22]  Anna Rumshisky,et al.  Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks , 2020, AAAI.

[23]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[24]  Changjian Chen,et al.  An Interactive Method to Improve Crowdsourced Annotations , 2019, IEEE Transactions on Visualization and Computer Graphics.

[25]  Michael S. Bernstein,et al.  Analytic Methods for Optimizing Realtime Crowdsourcing , 2012, ArXiv.

[26]  Scott R. Klemmer,et al.  Shepherding the crowd: managing and providing feedback to crowd workers , 2011, CHI Extended Abstracts.

[27]  Stefan Dietze,et al.  Using Worker Self-Assessments for Competence-Based Pre-Selection in Crowdsourcing Microtasks , 2017, ACM Trans. Comput. Hum. Interact..

[28]  Jennifer Wortman Vaughan Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research , 2017, J. Mach. Learn. Res..

[29]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[30]  Lydia B. Chilton,et al.  MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy , 2016, HCOMP.

[31]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[32]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[33]  Nancy Ide,et al.  Integrating Linguistic Resources: The American National Corpus Model , 2006, LREC.

[34]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[35]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[36]  Lawrence S. Moss,et al.  OCNLI: Original Chinese Natural Language Inference , 2020, FINDINGS.

[37]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[38]  Xiang Zhou,et al.  What Can We Learn from Collective Human Opinions on Natural Language Inference Data? , 2020, EMNLP.

[39]  Scott R. Klemmer,et al.  Shepherding the crowd yields better work , 2012, CSCW.

[40]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Samuel R. Bowman,et al.  When Do You Need Billions of Words of Pretraining Data? , 2020, ACL.

[43]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[44]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[45]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[46]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[47]  Claire Cardie,et al.  Improving Machine Reading Comprehension with General Reading Strategies , 2018, NAACL.

[48]  Min-Yen Kan,et al.  Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[49]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[50]  Adolfo Martínez Usó,et al.  Item response theory in AI: Analysing machine learning classifiers at the instance level , 2019, Artif. Intell..

[51]  Hao Wu,et al.  Building an Evaluation Scale using Item Response Theory , 2016, EMNLP.

[52]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[53]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[54]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[55]  Jesse Chandler,et al.  Risks and Rewards of Crowdsourcing Marketplaces , 2014, Handbook of Human Computation.

[56]  Hao Wu,et al.  Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds , 2019, EMNLP.

[57]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[58]  Hadas Kotek,et al.  Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution , 2020, COLING.