Evaluating rater quality and rating difficulty in online annotation activities

Gathering annotations from non-expert online raters is an attractive method for quickly completing large-scale annotation tasks, but the increased possibility of unreliable annotators and diminished work quality remains a cause for concern. In the context of information retrieval, where human-encoded relevance judgments underlie the evaluation of new systems and methods, the ability to quickly and reliably collect trustworthy annotations allows for quicker development and iteration of research. In the context of paid online workers, this study evaluates indicators of non-expert performance along three lines: temporality, experience, and agreement. It is found that user performance is a key indicator for future performance. Additionally, the time spent by raters familiarizing themselves with a new set of tasks is important for rater quality, as is long-term familiarity with a topic being rated. These findings may inform large-scale digital collections' use of non-expert raters for performing more purposive and affordable online annotation activities.

[1]  Jing Wang jwang Managing Crowdsourcing Workers , 2011 .

[2]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[3]  Panagiotis G. Ipeirotis,et al.  Quality-Based Pricing for Crowdsourced Workers , 2013 .

[4]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[5]  Juan Llorens Morillo,et al.  The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track , 2011, TREC.

[6]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[7]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[8]  Carla E. Brodley,et al.  Who Should Label What? Instance Allocation in Multiple Expert Active Learning , 2011, SDM.

[9]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[10]  Arjen P. de Vries,et al.  Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[11]  Nicholas J. Belkin,et al.  Proceedings of the third symposium on Information interaction in context , 2010, IIiX 2010.

[12]  Chris Callison-Burch,et al.  Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription , 2010, NAACL.

[13]  Katrina Fenlon,et al.  Building topic models in a federated digital library through selective document exclusion , 2011, ASIST.

[14]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[15]  Filip Radlinski,et al.  Personalizing web search using long term browsing history , 2011, WSDM '11.

[16]  Katrina Fenlon,et al.  Improving retrieval of short texts through document expansion , 2012, SIGIR '12.

[17]  Jeremy Pickens,et al.  Interactive information seeking via selective application of contextual knowledge , 2010, IIiX.

[18]  Peter Organisciak An iterative reliability measure for semi-anonymous annotators , 2012, JCDL '12.

[19]  Jaime G. Carbonell,et al.  A Probabilistic Framework to Learn from Multiple Annotators with Time-Varying Accuracy , 2010, SDM.