Active Learning for Crowd-Sourced Databases

Crowd-sourcing has become a popular means of acquiring labeled data for a wide variety of tasks where humans are more accurate than computers, e.g., labeling images, matching objects, or analyzing sentiment. However, relying solely on the crowd is often impractical even for data sets with thousands of items, due to time and cost constraints of acquiring human input (which cost pennies and minutes per label). In this paper, we propose algorithms for integrating machine learning into crowd-sourced databases, with the goal of allowing crowd-sourcing applications to scale, i.e., to handle larger datasets at lower costs. The key observation is that, in many of the above tasks, humans and machine learning algorithms can be complementary, as humans are often more accurate but slow and expensive, while algorithms are usually less accurate, but faster and cheaper. Based on this observation, we present two new active learning algorithms to combine humans and algorithms together in a crowd-sourced database. Our algorithms are based on the theory of non-parametric bootstrap, which makes our results applicable to a broad class of machine learning models. Our results, on three real-life datasets collected with Amazon's Mechanical Turk, and on 15 well-known UCI data sets, show that our methods on average ask humans to label one to two orders of magnitude fewer items to achieve the same accuracy as a baseline that labels random images, and two to eight times fewer questions than previous active learning schemes.

[1]  References , 1971 .

[2]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[3]  S. Lahiri Bootstrapping $M$-Estimators of a Multiple Linear Regression Parameter , 1992 .

[4]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[5]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[6]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[7]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[8]  Anuradha Bhamidipaty,et al.  ALIAS: An Active Learning led Interactive Deduplication System , 2002, VLDB.

[9]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[10]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[11]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[12]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[13]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[14]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[15]  Rong Jin,et al.  Batch mode active learning and its application to medical image classification , 2006, ICML.

[16]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[17]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[18]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[19]  Dale Schuurmans,et al.  Discriminative Batch Mode Active Learning , 2007, NIPS.

[20]  Steve Hanneke,et al.  A bound on the label complexity of agnostic active learning , 2007, ICML '07.

[21]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Naomi M. Kenner,et al.  Low target prevalence is a stubborn source of errors in visual search tasks. , 2007, Journal of experimental psychology. General.

[23]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[24]  Rong Jin,et al.  Semi-Supervised Boosting for Multi-Class Classification , 2008, ECML/PKDD.

[25]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[26]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Computer Speech and Language.

[27]  Yi Liu,et al.  SemiBoost: Boosting for Semi-Supervised Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[29]  John Langford,et al.  Importance weighted active learning , 2008, ICML '09.

[30]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[31]  K. Vijay-Shanker,et al.  A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping , 2009, CoNLL.

[32]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[33]  Jaime G. Carbonell,et al.  Active Learning and Crowd-Sourcing for Machine Translation , 2010, LREC.

[34]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[35]  Michael S. Bernstein,et al.  Soylent: a word processor with a crowd inside , 2010, UIST.

[36]  Chia-Hua Ho,et al.  Active learning strategies using SVMs , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[37]  John Langford,et al.  Agnostic Active Learning Without Constraints , 2010, NIPS.

[38]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[39]  Kristen Grauman,et al.  Cost-Sensitive Active Visual Category Learning , 2010, International Journal of Computer Vision.

[40]  Andreas Thor,et al.  Learning-Based Approaches for Matching Web Data Entities , 2010, IEEE Internet Computing.

[41]  Derek Greene,et al.  Using Crowdsourcing and Active Learning to Track Sentiment in Online Media , 2010, ECAI.

[42]  Jeffrey P. Bigham,et al.  VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[43]  Lise Getoor,et al.  Reducing Label Cost by Combining Feature Labels and Crowdsourcing , 2011 .

[44]  Shipeng Yu,et al.  An Entropic Score to Rank Annotators for Crowdsourced Labeling Tasks , 2011, 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics.

[45]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[46]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[47]  Min Song,et al.  Combining active learning and semi-supervised learning techniques to extract protein interaction sentences , 2011, BMC Bioinformatics.

[48]  Hinrich Schütze,et al.  Active Learning with Amazon Mechanical Turk , 2011, EMNLP.

[49]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[50]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[51]  Matthew Lease,et al.  Semi-Supervised Consensus Labeling for Crowdsourcing , 2011 .

[52]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[53]  Adam Marcus,et al.  Optimization techniques for human computation-enabled data processing systems , 2012 .

[54]  Purnamrita Sarkar,et al.  The Big Data Bootstrap , 2012, ICML.

[55]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[56]  David R. Karger,et al.  Counting with the Crowd , 2012, Proc. VLDB Endow..

[57]  Jennifer Widom,et al.  Deco: A System for Declarative Crowdsourcing , 2012, Proc. VLDB Endow..

[58]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[59]  John C. Platt,et al.  Learning from the Wisdom of Crowds by Minimax Entropy , 2012, NIPS.

[60]  Gjergji Kasneci,et al.  Crowd IQ: aggregating opinions to boost performance , 2012, AAMAS.

[61]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[62]  Purnamrita Sarkar,et al.  Getting It All from the Crowd , 2012, ArXiv.

[63]  Aditya G. Parameswaran,et al.  Active sampling for entity matching , 2012, KDD.

[64]  Wolfgang Lehner,et al.  Enhancing Named Entity Extraction by Effectively Incorporating the Crowd , 2013, BTW Workshops.

[65]  Purnamrita Sarkar,et al.  Crowdsourced enumeration queries , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[66]  Gerhard Weikum,et al.  Crowdsourced Entity Markup , 2013, CrowdSem.

[67]  John Langford,et al.  Para-active learning , 2013, ArXiv.

[68]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[69]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[70]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[71]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[72]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..