Crowdsourcing a Subjective Labeling Task: A Human-Centered Framework to Ensure Reliable Results

How can we best use crowdsourcing to perform a subjective labeling task with low inter-rater agreement? We have developed a framework for debugging this type of subjective judgment task, and for improving label quality before the crowdsourcing task is run at scale. Our framework alternately varies characteristics of the work, assesses the reliability of the workers, and strives to improve task design by disaggregating the labels into components that may be less subjective to the workers, thereby potentially improving inter-rater agreement. A second contribution of this work is the introduction of a technique, Human Intelligence Data-Driven Enquiries (HIDDEN), that uses Captcha-inspired subtasks to evaluate worker effectiveness and reliability while also producing useful results and enhancing task performance. HIDDEN subtasks pivot around the same data as the main task, but ask workers to perform less subjective judgment subtasks that result in higher inter-rater agreement. To illustrate our framework and techniques, we discuss our efforts to label high quality social media content, with the ultimate aim of identifying meaningful signal within complex results.

[1]  Donald Metzler,et al.  USC/ISI at TREC 2011: Microblog Track , 2011, TREC.

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Linda Shields,et al.  Content Analysis , 2015 .

[4]  Frank M. Shipman,et al.  Experiences surveying the crowd: reflections on methods, participation, and reliability , 2013, WebSci.

[5]  Peng Dai,et al.  Inserting Micro-Breaks into Crowdsourcing Workflows , 2013, HCOMP.

[6]  Toby Walsh,et al.  On the notion of interestingness in automated mathematical discovery , 2000, Int. J. Hum. Comput. Stud..

[7]  Michael S. Bernstein,et al.  Who gives a tweet?: evaluating microblog content value , 2012, CSCW.

[8]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[9]  Catherine C. Marshall,et al.  Are Some Tweets More Interesting Than Others? #HardQuestion , 2013, HCIR '13.

[10]  Oren Etzioni,et al.  Identifying interesting assertions from the web , 2009, CIKM.

[11]  Chris Welty,et al.  Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard , 2013 .

[12]  Omar Alonso,et al.  Detecting Uninteresting Content in Text Streams , 2010 .

[13]  John Langford,et al.  Telling humans and computers apart automatically , 2004, CACM.

[14]  Geert-Jan Houben,et al.  Identification of useful user comments in social media: a case study on flickr commons , 2013, JCDL '13.

[15]  P. Silvia What is interesting? Exploring the appraisal structure of interest. , 2005, Emotion.

[16]  Omar Alonso,et al.  Implementing crowdsourcing-based relevance experimentation: an industrial perspective , 2013, Information Retrieval.

[17]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.