Active learning with confidence-based answers for crowdsourcing labeling tasks

Abstract Collecting labels for data is important for many practical applications (e.g., data mining). However, this process can be expensive and time-consuming since it needs extensive efforts of domain experts. To decrease the cost, many recent works combine crowdsourcing, which outsources labeling tasks (usually in the form of questions) to a large group of non-expert workers, and active learning, which actively selects the best instances to be labeled, to acquire labeled datasets. However, for difficult tasks where workers are uncertain about their answers, asking for discrete labels might lead to poor performance due to the low-quality labels. In this paper, we design questions to get continuous worker responses which are more informative and contain workers’ labels as well as their confidence. As crowd workers may make mistakes, multiple workers are hired to answer each question. Then, we propose a new aggregation method to integrate the responses. By considering workers’ confidence information, the accuracy of integrated labels is improved. Furthermore, based on the new answers, we propose a novel active learning framework to iteratively select instances for “labeling”. We define a score function for instance selection by combining the uncertainty derived from the classifier model and the uncertainty derived from the answer sets. The uncertainty derived from uncertain answers is more effective than that derived from labels. We also propose batch methods which select multiple instances at a time to further improve the efficiency of our approach. Experimental studies on both simulated and real data show that our methods are effective in increasing the labeling accuracy and achieve significantly better performance than existing methods.

[1]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[2]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Derek Greene,et al.  Using Crowdsourcing and Active Learning to Track Sentiment in Online Media , 2010, ECAI.

[4]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[5]  Victor S. Sheng,et al.  Noise filtering to improve data and model quality for crowdsourcing , 2016, Knowl. Based Syst..

[6]  Mausam,et al.  Re-Active Learning: Active Learning with Relabeling , 2016, AAAI.

[7]  William H. Press,et al.  Numerical recipes in C++: the art of scientific computing, 2nd Edition (C++ ed., print. is corrected to software version 2.10) , 1994 .

[8]  Yu-Pu Yang,et al.  A batch-mode active learning SVM method based on semi-supervised clustering , 2015, Intell. Data Anal..

[9]  Francisco Cribari-Neto,et al.  Improved point and interval estimation for a beta regression model , 2006, Comput. Stat. Data Anal..

[10]  Dacheng Tao,et al.  Active Learning for Crowdsourcing Using Knowledge Transfer , 2014, AAAI.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[13]  Bartosz Krawczyk,et al.  Active and adaptive ensemble learning for online activity recognition from data streams , 2017, Knowl. Based Syst..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Chien-Ju Ho,et al.  Adaptive Task Assignment for Crowdsourced Classification , 2013, ICML.

[16]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..

[17]  Ngoc Thanh Nguyen,et al.  A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields , 2017, Knowl. Based Syst..

[18]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[19]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[20]  Xi Chen,et al.  Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling , 2014, J. Mach. Learn. Res..

[21]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[22]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[23]  Priyanka Agrawal,et al.  Sequential crowdsourced labeling as an epsilon-greedy exploration in a Markov Decision Process , 2014, AISTATS.

[24]  Ján Paralic,et al.  Active learning enhanced semi-automatic annotation tool for aspect-based sentiment analysis , 2013, 2013 IEEE 11th International Symposium on Intelligent Systems and Informatics (SISY).

[25]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[26]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[27]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[28]  Bo Du,et al.  A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification , 2016, Neurocomputing.

[29]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[30]  Gita Reese Sukthankar,et al.  Incremental Relabeling for Active Learning with Noisy Crowdsourced Annotations , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[31]  Joseph G. Davis,et al.  User interface design for crowdsourcing systems , 2014, AVI.

[32]  Xindong Wu,et al.  Active Learning With Imbalanced Multiple Noisy Labeling , 2015, IEEE Transactions on Cybernetics.

[33]  Hinrich Schütze,et al.  Active Learning with Amazon Mechanical Turk , 2011, EMNLP.

[34]  S. T. Buckland,et al.  An Introduction to the Bootstrap , 1994 .

[35]  Bin Li,et al.  A survey on instance selection for active learning , 2012, Knowledge and Information Systems.

[36]  Zhi-Hua Zhou,et al.  Active Learning from Crowds with Unsure Option , 2015, IJCAI.

[37]  Panagiotis G. Ipeirotis,et al.  Repeated labeling using multiple noisy labelers , 2012, Data Mining and Knowledge Discovery.

[38]  Kun Deng,et al.  Active Learning from Multiple Noisy Labelers with Varied Costs , 2010, 2010 IEEE International Conference on Data Mining.

[39]  Kamalika Chaudhuri,et al.  Active Learning from Weak and Strong Labelers , 2015, NIPS.

[40]  S. Ferrari,et al.  Beta Regression for Modelling Rates and Proportions , 2004 .