What if the irresponsible teachers are dominating? a method of training on samples and clustering on teachers

As the Internet-based crowdsourcing services become more and more popular, learning from multiple teachers or sources has received more attention of the researchers in the machine learning area. In this setting, the learning system is dealing with samples and labels provided by multiple teachers, who in common cases, are non-expert. Their labeling styles and behaviors are usually diverse, some of which are even detrimental to the learning system. Thus, simply putting them together and utilizing the algorithms designed for single-teacher scenario would be not only improper, but also damaging. The problem calls for more specific methods. Our work focuses on a case where the teachers are composed of good ones and irresponsible ones. By irresponsible, we mean the teacher who takes the labeling task not seriously and label the sample at random without inspecting the sample itself. This behavior is quite common when the task is not attractive enough and the teacher just wants to finish it as soon as possible. Sometimes, the irresponsible teachers could take a considerable part among all the teachers. If we do not take out their effects, our learning system would be ruined with no doubt. In this paper, we propose a method for picking out the good teachers with promising experimental results. It works even when the irresponsible teachers are dominating in numbers.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Koby Crammer,et al.  Learning from Multiple Sources , 2006, NIPS.

[3]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[6]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[8]  Ohad Shamir,et al.  Good learners for evil teachers , 2009, ICML '09.

[9]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[10]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[11]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[12]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[13]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[15]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[16]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.