Budget-optimal crowdsourcing using low-rank matrix approximations

Crowdsourcing systems, in which numerous tasks are electronically distributed to numerous “information pieceworkers”, have emerged as an effective paradigm for human-powered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a model of such crowdsourcing tasks and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers' answers. We show that our algorithm, based on low-rank matrix approximation, significantly outperforms majority voting and, in fact, is order-optimal through comparison to an oracle that knows the reliability of every worker.

[1]  Béla Bollobás,et al.  Random Graphs , 1985 .

[2]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[3]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[4]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[5]  Béla Bollobás,et al.  Random Graphs: Notation , 2001 .

[6]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[7]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[8]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, ISIT.

[9]  U. Feige,et al.  Spectral techniques applied to sparse random graphs , 2005 .

[10]  Rong Jin,et al.  Learning with Multiple Labels , 2002, NIPS.

[11]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[12]  Rüdiger L. Urbanke,et al.  Modern Coding Theory , 2008 .

[13]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[14]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Endre Szemerédi,et al.  On the second eigenvalue of random regular graphs , 1989, STOC '89.

[17]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..