In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd Workers

We propose a novel three-stage FIND-RESOLVE-LABEL workflow for crowdsourced annotation to reduce ambiguity in task instructions and thus improve annotation quality. Stage 1 (FIND) asks the crowd to find examples whose correct label seems ambiguous given task instructions. Workers are also asked to provide a short tag which describes the ambiguous concept embodied by the specific instance found. We compare collaborative vs. non-collaborative designs for this stage. In Stage 2 (RESOLVE), the requester selects one or more of these ambiguous examples to label (resolving ambiguity). The new label(s) are automatically injected back into task instructions in order to improve clarity. Finally, in Stage 3 (LABEL), workers perform the actual annotation using the revised guidelines with clarifying examples. We compare three designs for using these examples: examples only, tags only, or both. We report image labeling experiments over six task designs using Amazon’s Mechanical Turk. Results show improved annotation accuracy and further insights regarding effective design for crowdsourced annotation tasks.

[1]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[2]  Charles Cole,et al.  A theory of information need for information retrieval that connects information to knowledge , 2011, J. Assoc. Inf. Sci. Technol..

[3]  Aniket Kittur,et al.  Alloy: Clustering with Crowds and Computation , 2016, CHI.

[4]  Lydia B. Chilton,et al.  TurKit: human computation algorithms on mechanical turk , 2010, UIST.

[5]  V. K. Chaithanya Manam,et al.  TaskMate: A Mechanism to Improve the Quality of Instructions in Crowdsourcing , 2019, WWW.

[6]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[7]  Lydia B. Chilton,et al.  Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing , 2018, CHI.

[8]  Alex Pentland,et al.  Pickard Time-Critical Social Mobilization , 2011 .

[9]  Kevyn Collins-Thompson,et al.  Towards searching as a learning process: A review of current perspectives and future directions , 2016, J. Inf. Sci..

[10]  Yuandong Tian,et al.  Learning from crowds in the presence of schools of thought , 2012, KDD.

[11]  Alexis Battle,et al.  The jabberwocky programming environment for structured social computing , 2011, UIST.

[12]  Mucahid Kutlu,et al.  Annotator Rationales for Labeling Tasks in Crowdsourcing , 2020, J. Artif. Intell. Res..

[13]  Chirag Shah,et al.  Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-related Tasks , 2018, CHIIR.

[14]  James A. Hendler,et al.  A Study of the Human Flesh Search Engine: Crowd-Powered Expansion of Online Knowledge , 2010, Computer.

[15]  Karl Aberer,et al.  An Evaluation of Aggregation Techniques in Crowdsourcing , 2013, WISE.

[16]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[17]  Leon Derczynski,et al.  Directions in abusive language training data, a systematic review: Garbage in, garbage out , 2020, PloS one.

[18]  Serge Egelman,et al.  Crowdsourcing in HCI Research , 2014, Ways of Knowing in HCI.

[19]  Isaac Fung,et al.  Towards Hybrid Human-AI Workflows for Unknown Unknown Detection , 2020, WWW.

[20]  Tim Roughgarden,et al.  Mathematical foundations for social computing , 2016, Commun. ACM.

[21]  Matthew Lease,et al.  Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms , 2015 .

[22]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[23]  Jenny Chen,et al.  Opportunities for Crowdsourcing Research on Amazon Mechanical Turk , 2011 .

[24]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[25]  Guoliang Li,et al.  Truth Inference in Crowdsourcing: Is the Problem Solved? , 2017, Proc. VLDB Endow..

[26]  Michael S. Bernstein,et al.  The Daemo Crowdsourcing Marketplace , 2017, CSCW Companion.

[27]  Scott B. Huffman,et al.  How evaluator domain expertise affects search result relevance judgments , 2008, CIKM '08.

[28]  Björn Hartmann,et al.  MobileWorks: Designing for Quality in a Managed Crowdsourcing Architecture , 2012, IEEE Internet Computing.

[29]  Guan Wang,et al.  Crowdsourcing from Scratch: A Pragmatic Experiment in Data Collection by Novice Requesters , 2015, HCOMP.

[30]  Alexander J. Quinn,et al.  Task Design for Crowdsourcing Complex Cognitive Skills , 2021, CHI Extended Abstracts.

[31]  Todd Kulesza,et al.  Structured labeling for facilitating concept evolution in machine learning , 2014, CHI.

[32]  Lydia B. Chilton,et al.  Exploring iterative and parallel human computation processes , 2010, HCOMP '10.

[33]  Scott R. Klemmer,et al.  Shepherding the crowd yields better work , 2012, CSCW.

[34]  Georg Groh,et al.  Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics , 2020, ALW.

[35]  Michael S. Bernstein,et al.  Expert crowdsourcing with flash teams , 2014, UIST.

[36]  Mausam,et al.  Sprout: Crowd-Powered Task Design for Crowdsourcing , 2018, UIST.

[37]  V. K. Chaithanya Manam,et al.  WingIt: Efficient Refinement of Unclear Task Instructions , 2018, HCOMP.

[38]  Gareth J. F. Jones An Introduction to Crowdsourcing for Language and Multimedia Technology Research , 2012, PROMISE Winter School.

[39]  Fabio Casati,et al.  Detecting and preventing confused labels in crowdsourced data , 2020, Proc. VLDB Endow..

[40]  Michael S. Bernstein,et al.  Soylent: a word processor with a crowd inside , 2010, UIST.

[41]  Colin Vandenhof A Hybrid Approach to Identifying Unknown Unknowns of Predictive Models , 2019, AAAI 2019.

[42]  Matthew Lease,et al.  SQUARE: A Benchmark for Research on Computing Crowd Consensus , 2013, HCOMP.

[43]  Michael S. Bernstein,et al.  Flock: Hybrid Crowd-Machine Learning Classifiers , 2015, CSCW.

[44]  Matthew Lease,et al.  Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments , 2016, HCOMP.

[45]  Stephanie Strassel,et al.  The Query of Everything: Developing Open-Domain, Natural-Language Queries for BOLT Information Retrieval , 2016, LREC.

[46]  Ece Kamar,et al.  Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[47]  Omar Alonso,et al.  Practical Lessons for Gathering Quality Labels at Scale , 2015, SIGIR.

[48]  Li Fei-Fei,et al.  Crowdsourcing in Computer Vision , 2016, Found. Trends Comput. Graph. Vis..

[49]  Lora Aroyo,et al.  CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement (short paper) , 2018, SAD/CrowdBias@HCOMP.

[50]  Andrea Wiggins,et al.  Crowds and Camera Traps: Genres in Online Citizen Science Projects , 2019, HICSS.

[51]  Andrew McGregor,et al.  AutoMan: a platform for integrating human-based and digital computation , 2012, OOPSLA '12.

[52]  Matthew Lease,et al.  Modeling Temporal Crowd Work Quality with Limited Supervision , 2015, HCOMP.

[53]  Frank M. Shipman,et al.  Experiences surveying the crowd: reflections on methods, participation, and reliability , 2013, WebSci.

[54]  Falk Scholer,et al.  The effect of threshold priming and need for cognition on relevance calibration and assessment , 2013, SIGIR.

[55]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[56]  Lydia B. Chilton,et al.  MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy , 2016, HCOMP.

[57]  Maria De-Arteaga,et al.  Diversity in sociotechnical machine learning systems , 2021, Big Data Soc..

[58]  Björn Hartmann,et al.  Collaboratively crowdsourcing workflows with turkomatic , 2012, CSCW.

[59]  Adam Perer,et al.  Discovering and Validating AI Errors With Crowdsourced Failure Reports , 2021, Proc. ACM Hum. Comput. Interact..

[60]  Matthew Lease,et al.  Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings , 2016, HCOMP.

[61]  Panagiotis G. Ipeirotis,et al.  Beat the Machine: Challenging Workers to Find the Unknown Unknowns , 2011, Human Computation.

[62]  Alexander J. Quinn,et al.  Confusing the Crowd: Task Instruction Quality on Amazon Mechanical Turk , 2017, HCOMP.

[63]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[64]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[65]  Aniket Kittur,et al.  CrowdWeaver: visually managing complex crowd work , 2012, CSCW.

[66]  Jun Yu,et al.  A Human/Computer Learning Network to Improve Biodiversity Conservation and Research , 2012, AI Mag..

[67]  Alessandro Bozzon,et al.  Clarity is a Worthwhile Quality: On the Role of Task Clarity in Microtask Crowdsourcing , 2017, HT.

[68]  Toru Ishida,et al.  Understanding Crowdsourcing Workflow: Modeling and Optimizing Iterative and Parallel Processes , 2016, HCOMP.

[69]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[70]  Brent J. Hecht,et al.  Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards , 2015, CSCW.

[72]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[73]  Gregor Engels,et al.  What Is Unclear? Computational Assessment of Task Clarity in Crowdsourcing , 2021, HT.

[74]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.