Asking for a second opinion: Re-querying of noisy multi-class labels

In this paper, we propose a new maximum margin-based, active learning algorithm for identifying incorrectly labeled training data. The algorithm combines a round-robin approach for investigating each class with a simple, yet effective ranking metric called maximum negative margin (MNM). Samples are given to an expert for re-evaluation to determine if they are indeed mislabeled. We also propose using five active learning metrics, including uncertainty sampling with margin sampling (USMS) and minimum margin, for the noisy label task which have previously been used in the standard active learning setting for identifying new samples to label. USMS is very competitive with maximum negative margin. In addition, we consider other information theoretic objective criteria for this new task including uncertainty sampling with entropy, query-by-committee with voting entropy, and K-nearest neighbor with voting entropy, but these consistently perform worse than MNM and USMS. The MNM noisy label active learning algorithm can be useful in several different scenarios including data cleansing as a preprocessing step before training and identifying mislabeled examples in the test set.

[1]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[2]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[3]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[4]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[5]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[6]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[7]  Pang-Ning Tan,et al.  Kernel Based Detection of Mislabeled Training Examples , 2007, SDM.

[8]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[9]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[10]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[11]  Carla E. Brodley,et al.  Active Class Selection , 2007, ECML.

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Hsuan-Tien Lin,et al.  Improving Generalization by Data Categorization , 2005, PKDD.

[14]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[15]  Neil Genzlinger A. and Q , 2006 .

[16]  References , 1971 .

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[19]  W. Marsden I and J , 2012 .

[20]  Ashish Kapoor,et al.  Active learning for large multi-class problems , 2009, CVPR.

[21]  Ramesh Nallapati,et al.  CorrActive Learning: Learning from Noisy Data through Human Interaction , 2009 .

[22]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .