An Intrinsic Stopping Criterion for Committee-Based Active Learning

As supervised machine learning methods are increasingly used in language technology, the need for high-quality annotated language data becomes imminent. Active learning (AL) is a means to alleviate the burden of annotation. This paper addresses the problem of knowing when to stop the AL process without having the human annotator make an explicit decision on the matter. We propose and evaluate an intrinsic criterion for committee-based AL of named entity recognizers.

[1]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[2]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[3]  Jingbo Zhu,et al.  Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification , 2008, IJCNLP.

[4]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[5]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[6]  Hinrich Schütze,et al.  Stopping Criteria for Active Learning of Named Entity Recognition , 2008, COLING.

[7]  Miles Osborne,et al.  A Two-Stage Method for Active Learning of Statistical Grammars , 2005, IJCAI.

[8]  Jingbo Zhu,et al.  Multi-Criteria-Based Strategy to Stop Active Learning for Data Annotation , 2008, COLING.

[9]  Fredrik Olsson,et al.  Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora , 2008 .

[10]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[11]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[12]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[13]  Eric K. Ringger,et al.  Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation , 2007, LAW@ACL.

[14]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[17]  Shlomo Argamon,et al.  Minimizing Manual Annotation Cost in Supervised Training from Corpora , 1996, ACL.

[18]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[19]  Hinrich Schütze,et al.  Performance thresholding in practical text classification , 2006, CIKM '06.

[20]  Jian Su,et al.  Multi-Criteria-based Active Learning for Named Entity Recognition , 2004, ACL.

[21]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Computer Speech and Language.

[22]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[23]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[24]  Udo Hahn,et al.  Approximating Learning Curves for Active-Learning-Driven Annotation , 2008, LREC.

[25]  Ian Witten,et al.  Data Mining , 2000 .