Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

Quality assurance in crowdsourced annotation often involves having a given example labeled multiple times by different workers, then aggregating these labels. Unfortunately, the worker-example label matrix is typically sparse and imbalanced for two reasons: 1) the average crowd worker judges few examples; and 2) few labels are typically collected per example to reduce cost. To address this missing data problem, we propose use of probabilistic matrix factorization (PMF), a standard approach in collaborative filtering. To evaluate our approach, we measure accuracy of consensus labels computed from the input sparse matrix vs. the PMF-inferred complete matrix. We consider both unsupervised and supervised settings. In the supervised case, we evaluate both weighted voting and worker selection. Experiments are performed on both a synthetic data set and a real data set: crowd relevance judgments taken from the 2010 NIST TREC Relevance Feedback Track.

[1]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[2]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[3]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[4]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[5]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[6]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[7]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[8]  C. Buckley,et al.  Overview of the TREC 2010 Relevance Feedback Track ( Notebook ) , 2010 .

[9]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[10]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[11]  Changshui Zhang,et al.  What if the irresponsible teachers are dominating? a method of training on samples and clustering on teachers , 2010, AAAI 2010.

[12]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[13]  Jianwen Zhang,et al.  What if the Irresponsible Teachers Are Dominating? , 2010, AAAI.

[14]  Matthew Lease,et al.  Improving Consensus Accuracy via Z-Score and Weighted Voting , 2011, Human Computation.

[15]  Mark D. Smucker Crowdsourcing with a Crowd of One and Other TREC 2011 Crowdsourcing and Web Track Experiments , 2011, TREC.