SQUARE: A Benchmark for Research on Computing Crowd Consensus

While many statistical consensus methods now exist, relatively little comparative benchmarking and integration of techniques has made it increasingly difficult to determine the current state-of-the-art, to evaluate the relative benefit of new methods, to understand where specific problems merit greater attention, and to measure field progress over time. To make such comparative evaluation easier for everyone, we present SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. In addition to measuring performance on a variety of public, real crowd datasets, the benchmark also varies supervision and noise by manipulating training size and labeling error. We envision SQUARE as dynamic and continually evolving, with new datasets and reference implementations being added according to community needs and interest. We invite community contributions and participation.

[1]  Matthew Lease,et al.  Improving Consensus Accuracy via Z-Score and Weighted Voting , 2011, Human Computation.

[2]  Purnamrita Sarkar,et al.  Active Learning for Crowd-Sourced Databases , 2012, ArXiv.

[3]  Chao Liu,et al.  TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings , 2012, ICML.

[4]  Praveen Paritosh,et al.  Human Computation Must Be Reproducible , 2012, CrowdSearch.

[5]  Jing Wang jwang Managing Crowdsourcing Workers , 2011 .

[6]  Beata Beigman Klebanov,et al.  Some Empirical Evidence for Annotation Noise in a Benchmarked Dataset , 2010, HLT-NAACL.

[7]  Gordon V. Cormack,et al.  Spam filter evaluation with imprecise ground truth , 2009, SIGIR.

[8]  C. Buckley,et al.  Overview of the TREC 2010 Relevance Feedback Track ( Notebook ) , 2010 .

[9]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[10]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[11]  Abhimanu Kumar Modeling Annotator Accuracies for Supervised Learning , 2011 .

[12]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[13]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[14]  Matthew Lease,et al.  On Quality Control and Machine Learning in Crowdsourcing , 2011, Human Computation.

[15]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[16]  Yee Whye Teh,et al.  Inferring ground truth from multi-annotator ordinal data: a probabilistic approach , 2013, ArXiv.

[17]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[18]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[19]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[20]  Matthew Lease,et al.  Semi-Supervised Consensus Labeling for Crowdsourcing , 2011 .

[21]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[22]  Matthew Lease,et al.  Crowdsourcing Document Relevance Assessment with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[23]  Karl Aberer,et al.  An Evaluation of Aggregation Techniques in Crowdsourcing , 2013, WISE.

[24]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[25]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[26]  Matthew Lease,et al.  Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization , 2012, HCOMP@AAAI.

[27]  Faiza Khan Khattak Quality Control of Crowd Labeling through Expert Evaluation , 2011 .

[28]  Devavrat Shah,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2011, NIPS.

[29]  Luis von Ahn Human Computation , 2008, ICDE.

[30]  Mausam,et al.  Crowdsourcing Control: Moving Beyond Multiple Choice , 2012, UAI.