Identifying and Mitigating Labelling Errors in Active Learning

Most existing active learning methods for classification, assume that the observed labels (i.e. given by a human labeller) are perfectly correct. However, in real world applications, the labeller is usually subject to labelling errors that reduce the classification accuracy of the learned model. In this paper, we address this issue for active learning in the streaming setting and we try to answer the following questions: (1) which labelled instances are most likely to be mislabelled? (2) is it always good to abstain from learning when data is suspected to be mislabelled? (3) which mislabelled instances require relabelling? We propose a hybrid active learning strategy based on two measures. The first measure allows to filter the potentially mislabelled instances, based on the degree of disagreement among the manually given label and the predicted class label. The second measure allows to select (for relabelling) only the most informative instances that deserve to be corrected. An instance is worth relabelling if it shows highly conflicting information among the predicted and the queried labels. Experiments on several real world data show that filtering mislabelled instances according to the first measure and relabelling few instances selected according to the second measure, greatly improves the classification accuracy of the stream-based active learning.

[1]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[3]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[4]  Devis Tuia,et al.  Learning User's Confidence for Active Learning , 2013, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[6]  Xindong Wu,et al.  Cleansing Noisy Data Streams , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[7]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[8]  Jun-Ming Xu,et al.  OASIS: Online Active Semi-Supervised Learning , 2011, AAAI.

[9]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[10]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[13]  Christian Igel,et al.  Active learning with support vector machines , 2014, WIREs Data Mining Knowl. Discov..

[14]  Geoff Holmes,et al.  Active Learning With Drifting Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Xingquan Zhu,et al.  Active learning with uncertain labeling knowledge , 2014, Pattern Recognit. Lett..

[16]  Panagiotis G. Ipeirotis,et al.  Repeated labeling using multiple noisy labelers , 2012, Data Mining and Knowledge Discovery.

[17]  Yolande Belaïd,et al.  A Stream-Based Semi-supervised Active Learning Approach for Document Classification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[18]  Lei Huang,et al.  Graph-based active semi-supervised learning: A new perspective for relieving multi-class annotation labor , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[19]  Carla E. Brodley,et al.  Active Label Correction , 2012, 2012 IEEE 12th International Conference on Data Mining.

[20]  Dan Kushnir,et al.  Active-transductive learning with label-adapted kernels , 2014, KDD.

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .