Content and Behaviour Based Metrics for Crowd Truth

When crowdsourcing gold standards for NLP tasks, the workers may not reach a consensus on a single correct solution for each task. The goal of Crowd Truth is to embrace such disagreement between individual annotators and harness it as useful information to signal vague or ambiguous examples. Even though the technique relies on disagreement, we also assume that the differing opinions will cluster around the more plausible alternatives. Therefore it is possible to identify workers who systematically disagree both with the majority opinion and with the rest of their co-workersas low quality or spam workers. We present in this paper a more detailed formalization of metrics for Crowd Truth in the context of medical relation extraction, and a set of additional filtering techniques that require the workers to briefly justify their answers. These explanation-based techniques are shown to be particularly useful in conjunction with disagreement-based metrics, and achieve 95% accuracy for identifying low quality and spam submissions in crowdsourcing settings where spam is quite high.

[1]  Elena Paslaru Bontas Simperl,et al.  CrowdMap: Crowdsourcing Ontology Alignment with Microtasks , 2012, SEMWEB.

[2]  Lora Aroyo,et al.  Exploiting Crowdsourcing Disagreement with Various Domain-Independent Quality Measures , 2013 .

[3]  Siddharth Patwardhan,et al.  Structured data and inference in DeepQA , 2012, IBM J. Res. Dev..

[4]  Aditya G. Parameswaran,et al.  Evaluating the crowd with confidence , 2013, KDD.

[5]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[6]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[7]  Arjen P. de Vries,et al.  Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[8]  Frank M. Shipman,et al.  Experiences surveying the crowd: reflections on methods, participation, and reliability , 2013, WebSci.

[9]  Chris Welty,et al.  Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard , 2013 .

[10]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[11]  Lora Aroyo,et al.  Dr. Detective: combining gamification techniques and crowdsourcing to create a gold standard for the medical domain , 2013 .

[12]  Hector Garcia-Molina,et al.  Quality control for comparison microtasks , 2012, CrowdKDD '12.

[13]  I. A. Richards,et al.  The Meaning of Meaning: a Study of the Influence of Language upon Thought and of the Science of Symbolism , 1923, Nature.

[14]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .