Classifier self-assessment: active learning and active noise correction for document classification

This paper introduces two novel techniques that improve document classification while reducing the amount of manual work by the user. The first technique applies uncertainty sampling as a metric for batch-mode active learning to suggest only the most interesting documents for the manual labeling process, resulting in a steep improvement even for small training sets. This addresses the problem of creating and improving an initial training set. The second technique focuses on cleaning an existing large set of weakly labeled documents by active noise correction. The classifier's self-assessment is used to detect mislabeled documents which are then reclassified. For active noise correction, two approaches are explored: one based on a human expert and one that automatically corrects the assigned labels.

[1]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[2]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[3]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[4]  Jingbo Zhu,et al.  Confidence-based stopping criteria for active learning for data annotation , 2010, TSLP.

[5]  Ramesh Nallapati,et al.  CorrActive Learning: Learning from Noisy Data through Human Interaction , 2009 .

[6]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[7]  Yolande Belaïd,et al.  An Adaptive Incremental Clustering Method based on the Growing Neural Gas Algorithm , 2013, ICPRAM.

[8]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[9]  Jieping Ye,et al.  Querying discriminative and representative samples for batch mode active learning , 2013, KDD.

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Computer Speech and Language.

[12]  Yolande Belaïd,et al.  A Stream-Based Semi-supervised Active Learning Approach for Document Classification , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[14]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[15]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.