Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks

Crowdsourcing is a common strategy for collecting the “gold standard” labels required for many natural language applications. Crowdworkers differ in their responses for many reasons, but existing approaches often treat disagreements as "noise" to be removed through filtering or aggregation. In this paper, we introduce the workflow design pattern of crowd parting: separating workers based on shared patterns in responses to a crowdsourcing task. We illustrate this idea using an automated clustering-based method to identify divergent, but valid, worker interpretations in crowdsourced entity annotations collected over two distinct corpora -- Wikipedia articles and Tweets. We demonstrate how the intermediate-level view provide by crowd-parting analysis provides insight into sources of disagreement not easily gleaned from viewing either individual annotation sets or aggregated results. We discuss several concrete applications for how this approach could be applied directly to improving the quality and efficiency of crowdsourced annotation tasks.

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  Krzysztof Z. Gajos,et al.  Crowdsourcing step-by-step information extraction to enhance existing how-to videos , 2014, CHI.

[3]  Lorrie Faith Cranor,et al.  Are your participants gaming the system?: screening mechanical turk workers , 2010, CHI.

[4]  Harald Sack The Semantic Web. Latest Advances and New Domains , 2016, Lecture Notes in Computer Science.

[5]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[6]  Aniket Kittur,et al.  Crowd synthesis: extracting categories and clusters from complex data , 2014, CSCW.

[7]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[8]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[9]  Qi Su,et al.  Internet-scale collection of human-reviewed data , 2007, WWW '07.

[10]  Eric Gilbert,et al.  Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk , 2015, CHI.

[11]  Elisa Bertino,et al.  Quality Control in Crowdsourcing Systems: Issues and Directions , 2013, IEEE Internet Computing.

[12]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[13]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[14]  Wei Wu,et al.  Structuring, Aggregating, and Evaluating Crowdsourced Design Critique , 2015, CSCW.

[15]  Joel Nothman,et al.  Learning multilingual named entity recognition from Wikipedia , 2013, Artif. Intell..

[16]  Will Fitzgerald,et al.  A Hybrid Model for Annotating Named Entity Training Corpora , 2010, Linguistic Annotation Workshop.

[17]  Bob J. Wielinga,et al.  Let's agree to disagree: on the evaluation of vocabulary alignment , 2011, K-CAP '11.

[18]  J. Knowlton On the definition of “Picture” , 1966 .

[19]  Henning Müller,et al.  Ground truth generation in medical imaging: a crowdsourcing-based iterative approach , 2012, CrowdMM '12.

[20]  Jeffrey Heer,et al.  Identifying medical terms in patient-authored text: a crowdsourcing-based approach , 2013, J. Am. Medical Informatics Assoc..

[21]  Janyce Wiebe,et al.  Development and Use of a Gold-Standard Data Set for Subjectivity Classifications , 1999, ACL.

[22]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[23]  Lora Aroyo,et al.  Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation , 2015, AI Mag..

[24]  Maxine Eskénazi,et al.  Clustering dictionary definitions using Amazon Mechanical Turk , 2010, Mturk@HLT-NAACL.

[25]  Dean P. Foster,et al.  New Insights from Coarse Word Sense Disambiguation in the Crowd , 2012, COLING.

[26]  Derek Greene,et al.  Using Crowdsourcing and Active Learning to Track Sentiment in Online Media , 2010, ECAI.

[27]  Brent J. Hecht,et al.  Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards , 2015, CSCW.

[28]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[29]  P. Comon a Probabilistic Approach , 1995 .

[30]  Gianluca Demartini,et al.  Pick-a-crowd: tell me what you like, and i'll tell you what to do , 2013, CIDR.

[31]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[32]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[33]  Krzysztof Z. Gajos,et al.  Toward Collaborative Ideation at Scale: Leveraging Ideas from Others to Generate More Creative and Diverse Ideas , 2015, CSCW.

[34]  Anca Dumitrache Crowdsourcing Disagreement for Collecting Semantic Annotation , 2015, ESWC.

[35]  Björn Hartmann,et al.  Identifying Redundancy and Exposing Provenance in Crowdsourced Data Analysis , 2013, IEEE Transactions on Visualization and Computer Graphics.

[36]  Jeffrey Heer,et al.  Strategies for crowdsourcing social data analysis , 2012, CHI.

[37]  Rada Mihalcea,et al.  Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation , 2003 .

[38]  Shipeng Yu,et al.  Ranking annotators for crowdsourced labeling tasks , 2011, NIPS.

[39]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[40]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[41]  S. Varadhan,et al.  A probabilistic approach to , 1974 .

[42]  Benjamin E. Lauderdale,et al.  Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data , 2016, American Political Science Review.

[43]  Yee Whye Teh,et al.  Inferring ground truth from multi-annotator ordinal data: a probabilistic approach , 2013, ArXiv.

[44]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[45]  Maneesh Agrawala,et al.  Extracting references between text and charts via crowdsourcing , 2014, CHI.

[46]  Michael S. Bernstein,et al.  Who gives a tweet?: evaluating microblog content value , 2012, CSCW.