A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now several proposed approaches involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective training data based on model predictions. This paper poses the question: given this recent progress, and limited human annotation, what is the most effective method for efficiently creating high-quality entity recognizers in under-resourced languages? Based on extensive experimentation using both simulated and real human annotation, we find a dual-strategy approach best, starting with a cross-lingual transferred model, then performing targeted annotation of only uncertain entity spans in the target language, minimizing annotator effort. Results demonstrate that cross-lingual transfer is a powerful tool when very little data can be annotated, but an entity-targeted annotation strategy can achieve competitive accuracy quickly, with just one-tenth of training data.

[1]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[2]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[3]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[4]  Jason Baldridge,et al.  Learning a Part-of-Speech Tagger from Two Hours of Annotation , 2013, NAACL.

[5]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Satoshi Nakamura,et al.  Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff , 2014, TACL.

[8]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[9]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[10]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[11]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[12]  Jaime G. Carbonell,et al.  Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations , 2018, EMNLP.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  Guillaume Lample,et al.  Massively Multilingual Word Embeddings , 2016, ArXiv.

[15]  Trevor Cohn,et al.  Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary , 2017, ACL.

[16]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[17]  Hiroya Takamura,et al.  Active Learning with Subsequence Sampling Strategy for Sequence Labeling Tasks , 2011 .

[18]  Tom M. Mitchell,et al.  Joint Extraction of Events and Entities within a Document Context , 2016, NAACL.

[19]  Yuji Matsumoto,et al.  Training Conditional Random Fields Using Incomplete Annotations , 2008, COLING.

[20]  Imed Zitouni,et al.  Mention Detection Crossing the Language Barrier , 2008, EMNLP.

[21]  Hua Xu,et al.  A study of active learning methods for named entity recognition in clinical text , 2015, J. Biomed. Informatics.

[22]  Stephen D. Mayhew,et al.  Cheap Translation for Cross-Lingual Named Entity Recognition , 2017, EMNLP.

[23]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[24]  Diego Marcheggiani,et al.  An Experimental Comparison of Active Learning Strategies for Partially Labeled Sequences , 2014, EMNLP.

[25]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[26]  Ralf Steinberger,et al.  Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection , 2011, RANLP.

[27]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[28]  Jaime G. Carbonell,et al.  Neural Cross-Lingual Named Entity Recognition with Minimal Resources , 2018, EMNLP.

[29]  Xianpei Han,et al.  An Entity-Topic Model for Entity Linking , 2012, EMNLP.

[30]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[31]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[32]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[33]  Andrew McCallum,et al.  Learning Extractors from Unlabeled Text using Relevant Databases , 2007 .

[34]  Jaime G. Carbonell,et al.  Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings , 2016, EMNLP.

[35]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[36]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[37]  Sophia Ananiadou,et al.  Proactive Learning for Named Entity Recognition , 2017, BioNLP.