An Evaluation of Aggregation Techniques in Crowdsourcing

As the volumes of AI problems involving, human knowledge are likely to soar, crowdsourcing has become essential in a wide range of world-wide-web applications. One of the biggest challenges of crowdsourcing is aggregating the answers collected from the crowd since the workers might have wide-ranging levels of expertise. In order to tackle this challenge, many aggregation techniques have been proposed. These techniques, however, have never been compared and analyzed under the same setting, rendering a ‘right’ choice for a particular application very difficult. Addressing this problem, this paper presents a benchmark that offers a comprehensive empirical study on the performance comparison of the aggregation techniques. Specifically, we integrated several state-of-the-art methods in a comparable manner, and measured various performance metrics with our benchmark, including computation time, accuracy, robustness to spammers, and adaptivity to multi-labeling. We then provide in-depth analysis of benchmarking results, obtained by simulating the crowdsourcing process with different types of workers. We believe that the findings from the benchmark will be able to serve as a practical guideline for crowdsourcing applications.

[1]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[2]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[3]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[4]  Ansaf Salleb-Aouissi,et al.  Improving Crowd Labeling through Expert Evaluation , 2012, AAAI Spring Symposium: Wisdom of the Crowd.

[5]  Eric Horvitz,et al.  Incentives for truthful reporting in crowdsourcing , 2012, AAMAS.

[6]  Kyumin Lee,et al.  The social honeypot project: protecting online communities from spammers , 2010, WWW '10.

[7]  Karl Aberer,et al.  Collaborative Schema Matching Reconciliation , 2013, OTM Conferences.

[8]  Karl Aberer,et al.  Minimizing Human Effort in Reconciling Match Networks , 2013, ER.

[9]  Gabriella Kazai,et al.  Worker types and personality traits in crowdsourcing relevance labels , 2011, CIKM '11.

[10]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[11]  Karl Aberer,et al.  On Leveraging Crowdsourcing Techniques for Schema Matching Networks , 2013, DASFAA.

[12]  Faiza Khan Khattak Quality Control of Crowd Labeling through Expert Evaluation , 2011 .

[13]  Luis von Ahn,et al.  Human computation , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[14]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[15]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[16]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[17]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[18]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[19]  Vikas Kumar,et al.  CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones , 2010, MobiSys '10.

[20]  Devavrat Shah,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2011, NIPS.

[21]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[22]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.

[23]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[24]  Karl Aberer,et al.  BATC: a benchmark for aggregation techniques in crowdsourcing , 2013, SIGIR.

[25]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[26]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[27]  Gianluca Demartini,et al.  Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms , 2012, CrowdSearch.

[28]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.