论文信息 - A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers - 字舞流文

A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lower-resourced languages. However, there are now several proposed approaches involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective training data based on model predictions. This paper poses the question: given this recent progress, and limited human annotation, what is the most effective method for efficiently creating high-quality entity recognizers in under-resourced languages? Based on extensive experimentation using both simulated and real human annotation, we find a dual-strategy approach best, starting with a cross-lingual transferred model, then performing targeted annotation of only uncertain entity spans in the target language, minimizing annotator effort. Results demonstrate that cross-lingual transfer is a powerful tool when very little data can be annotated, but an entity-targeted annotation strategy can achieve competitive accuracy quickly, with just one-tenth of training data.

Jaime G. Carbonell | Graham Neubig | Aditi Chaudhary | Jiateng Xie | Zaid Sheikh | J. Carbonell | Graham Neubig | Zaid A. W. Sheikh | Aditi Chaudhary | Jiateng Xie

[1] Jason Weston,et al. Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[2] Mark Craven,et al. An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[3] Eneko Agirre,et al. Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[4] Jason Baldridge,et al. Learning a Part-of-Speech Tagger from Two Hours of Annotation , 2013, NAACL.

[5] Eduard H. Hovy,et al. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[6] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7] Satoshi Nakamura,et al. Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff , 2014, TACL.

[8] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[9] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[10] David Yarowsky,et al. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[11] Graham Neubig,et al. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[12] Jaime G. Carbonell,et al. Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations , 2018, EMNLP.

[13] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14] Guillaume Lample,et al. Massively Multilingual Word Embeddings , 2016, ArXiv.

[15] Trevor Cohn,et al. Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary , 2017, ACL.

[16] Andrew McCallum,et al. Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[17] Hiroya Takamura,et al. Active Learning with Subsequence Sampling Strategy for Sequence Labeling Tasks , 2011 .

[18] Tom M. Mitchell,et al. Joint Extraction of Events and Entities within a Document Context , 2016, NAACL.

[19] Yuji Matsumoto,et al. Training Conditional Random Fields Using Incomplete Annotations , 2008, COLING.

[20] Imed Zitouni,et al. Mention Detection Crossing the Language Barrier , 2008, EMNLP.

[21] Hua Xu,et al. A study of active learning methods for named entity recognition in clinical text , 2015, J. Biomed. Informatics.

[22] Stephen D. Mayhew,et al. Cheap Translation for Cross-Lingual Named Entity Recognition , 2017, EMNLP.

[23] Jason Baldridge,et al. How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[24] Diego Marcheggiani,et al. An Experimental Comparison of Active Learning Strategies for Partially Labeled Sequences , 2014, EMNLP.

[25] Anima Anandkumar,et al. Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[26] Ralf Steinberger,et al. Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection , 2011, RANLP.

[27] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[28] Jaime G. Carbonell,et al. Neural Cross-Lingual Named Entity Recognition with Minimal Resources , 2018, EMNLP.

[29] Xianpei Han,et al. An Entity-Topic Model for Entity Linking , 2012, EMNLP.

[30] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[31] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[32] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[33] Andrew McCallum,et al. Learning Extractors from Unlabeled Text using Relevant Databases , 2007 .

[34] Jaime G. Carbonell,et al. Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings , 2016, EMNLP.

[35] Andrew McCallum,et al. Confidence Estimation for Information Extraction , 2004, NAACL.

[36] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[37] Sophia Ananiadou,et al. Proactive Learning for Named Entity Recognition , 2017, BioNLP.