Active Learning with c-Certainty

It is well known that the noise in labels deteriorates the performance of active learning. To reduce the noise, works on multiple oracles have been proposed. However, there is still no way to guarantee the label quality. In addition, most previous works assume that the noise level of oracles is evenly distributed or example-independent which may not be realistic. In this paper, we propose a novel active learning paradigm in which oracles can return both labels and confidences. Under this paradigm, we then propose a new and effective active learning strategy that can guarantee the quality of labels by querying multiple oracles. Furthermore, we remove the assumptions of the previous works mentioned above, and design a novel algorithm that is able to select the best oracles to query. Our empirical study shows that the new algorithm is robust, and it performs well with given different types of oracles. As far as we know, this is the first work that proposes this new active learning paradigm and an active learning algorithm in which label quality is guaranteed.

[1]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[2]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[4]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[5]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[6]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[7]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[8]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[9]  Kun Deng,et al.  Active Learning from Multiple Noisy Labelers with Varied Costs , 2010, 2010 IEEE International Conference on Data Mining.

[10]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[11]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[12]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[13]  Jun Du,et al.  Active Learning with Human-Like Noisy Oracle , 2010, 2010 IEEE International Conference on Data Mining.