On Optimality of Jury Selection in Crowdsourcing

Recent advances in crowdsourcing technologies enable computationally challenging tasks (e.g., sentiment analysis and entity resolution) to be performed by Internet workers, driven mainly by monetary incentives. A fundamental question is: how should workers be selected, so that the tasks in hand can be accomplished successfully and economically? In this paper, we study the Jury Selection Problem (JSP): Given a monetary budget, and a set of decision-making tasks (e.g., “Is Bill Gates still the CEO of Microsoft now?”), return the set of workers (called jury), such that their answers yield the highest “Jury Quality” (or JQ). Existing JSP solutions make use of the Majority Voting (MV) strategy, which uses the answer chosen by the largest number of workers. We show that MV does not yield the best solution for JSP. We further prove that among all voting strategies (including deterministic and randomizedstrategies), BayesianVoting(BV)canoptimallysolveJSP. We then examine how to solve JSP based on BV. This is technically challenging, since computing the JQ with BV is NP-hard. We solve this problem by proposing an approximate algorithm that is computationally efficient. Our approximate JQ computation algorithm is also highly accurate, and its error is proved to be bounded within 1%. We extend our solution by considering the task owner’s “belief” (or prior) on the answers of the tasks. Experiments on synthetic and real datasets show that our new approach is consistently better than the best JSP solution known.

[1]  David Lee,et al.  Triadic Consensus - A Randomized Algorithm for Voting in a Crowd , 2012, WINE.

[2]  Reynold Cheng,et al.  Optimizing plurality for human intelligence tasks , 2013, CIKM.

[3]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[4]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[5]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[6]  Jennifer Widom,et al.  Deco: A System for Declarative Crowdsourcing , 2012, Proc. VLDB Endow..

[7]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[8]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[9]  E. M. Lifshitz,et al.  Statistical physics. Pt.1, Pt.2 , 1980 .

[10]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[11]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[12]  S. G. Zhukov,et al.  Application of simulated annealing approach for structure solution of molecular crystals from X-ray laboratory powder data , 2001 .

[13]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Zeitschrift Für,et al.  International journal for structural, physical, and chemical aspects of crystalline materials , 2004 .

[16]  Beng Chin Ooi,et al.  Online data fusion , 2011, Proc. VLDB Endow..

[17]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[18]  Shmuel Nitzan,et al.  Collective Decision-Making and Jury Theorems , 2017 .

[19]  Ali Hamzeh,et al.  TeamFinder: A Co-clustering based Framework for Finding an Effective Team of Experts in Social Networks , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[20]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[21]  Tim Kraska,et al.  CrowdDB: Query Processing with the VLDB Crowd , 2011, Proc. VLDB Endow..

[22]  Karl Aberer,et al.  On Leveraging Crowdsourcing Techniques for Schema Matching Networks , 2013, DASFAA.

[23]  François Laviolette,et al.  Learning with Randomized Majority Votes , 2010, ECML/PKDD.

[24]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[25]  Andreas Drexl,et al.  A simulated annealing approach to the multiconstraint zero-one knapsack problem , 1988, Computing.

[26]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[27]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[28]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[29]  Hector Garcia-Molina,et al.  Quality control for comparison microtasks , 2012, CrowdKDD '12.

[30]  John Duggan,et al.  A Bayesian Model of Voting in Juries , 2001, Games Econ. Behav..

[31]  M. R. Rao,et al.  The partition problem , 1993, Math. Program..

[32]  Lei Chen,et al.  Reducing Uncertainty of Schema Matching via Crowdsourcing , 2013, Proc. VLDB Endow..

[33]  Ohad Greenshpan,et al.  Asking the Right Questions in Crowd Data Sourcing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[34]  Craig Boutilier,et al.  Bayesian Vote Manipulation: Optimal Strategies and Impact on Welfare , 2012, UAI.

[35]  Milad Shokouhi,et al.  Community-based bayesian aggregation models for crowdsourcing , 2014, WWW.

[36]  Reynold Cheng,et al.  On incentive-based tagging , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[37]  Lei Chen,et al.  Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services , 2012, Proc. VLDB Endow..

[38]  Theodoros Lappas,et al.  Finding a team of experts in social networks , 2009, KDD.

[39]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[40]  David Bookstaber Simulated Annealing for Traveling Salesman Problem , 1999 .

[41]  Daren C. Brabham Crowdsourcing as a Model for Problem Solving , 2008 .

[42]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[43]  Gang Chen,et al.  An online cost sensitive decision-making method in crowdsourcing systems , 2013, SIGMOD '13.

[44]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.