Building Low-Resource NER Models Using Non-Speaker Annotation

In low-resource natural language processing (NLP), the key problem is a lack of training data in the target language. Cross-lingual methods have had notable success in addressing this concern, but in certain common circumstances, such as insufficient pre-training corpora or languages far from the source language, their performance suffers. In this work we propose an alternative approach to building low-resource Named Entity Recognition (NER) models using "non-speaker" (NS) annotations, provided by annotators with no prior experience in the target language. We recruit 30 participants to annotate unfamiliar languages in a carefully controlled annotation experiment, using Indonesian, Russian, and Hindi as target languages. Our results show that use of non-speaker annotators produces results that approach or match performance of fluent speakers. NS results are also consistently on par or better than cross-lingual methods built on modern contextual representations, and have the potential to further outperform with additional effort. We conclude with observations of common annotation practices and recommendations for maximizing non-speaker annotator performance.

[1]  Victor O. K. Li,et al.  Universal Neural Machine Translation for Extremely Low Resource Languages , 2018, NAACL.

[2]  Jaime G. Carbonell,et al.  Neural Cross-Lingual Named Entity Recognition with Minimal Resources , 2018, EMNLP.

[3]  Chris Callison-Burch,et al.  Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription , 2010, NAACL.

[4]  Zeljko Agic,et al.  Low-resource named entity recognition via multi-source projection: Not quite there yet? , 2018, NUT@EMNLP.

[5]  Mark Hasegawa-Johnson,et al.  Acquiring Speech Transcriptions Using Mismatched Crowdsourcing , 2015, AAAI.

[6]  Jaime G. Carbonell,et al.  A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers , 2019, EMNLP.

[7]  Dan Roth,et al.  Named Entity Recognition with Partially Annotated Training Data , 2019, CoNLL.

[8]  Kevin Knight,et al.  Out-of-the-box Universal Romanization Tool uroman , 2018, ACL.

[9]  Trevor Cohn,et al.  Learning when to trust distant supervision: An application to low-resource POS tagging using cross-lingual projection , 2016, CoNLL.

[10]  Stephen D. Mayhew,et al.  Cross-Lingual Named Entity Recognition via Wikification , 2016, CoNLL.

[11]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[12]  Jasone Cenoz,et al.  The multilingual lexicon , 2003 .

[13]  Dirk Hovy,et al.  If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages , 2015, ACL.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Barbara Plank,et al.  Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging , 2018, EMNLP.

[16]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[17]  Graham Neubig,et al.  Generalized Data Augmentation for Low-Resource Translation , 2019, ACL.

[18]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[19]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[20]  Mark Hasegawa-Johnson,et al.  Mismatched Crowdsourcing based Language Perception for Under-resourced Languages , 2016, SLTU.

[21]  Stephen D. Mayhew,et al.  TALEN: Tool for Annotation of Low-resource ENtities , 2018, ACL.

[22]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[23]  Kevin Knight,et al.  Translating a Language You Don't Know In the Chinese Room , 2018, ACL.

[24]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[25]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[26]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[27]  Mark Dredze,et al.  Are All Languages Created Equal in Multilingual BERT? , 2020, REPL4NLP.

[28]  Stephanie Strassel,et al.  LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages , 2016, LREC.

[29]  Seth Kulick,et al.  Corpus Building for Low Resource Languages in the DARPA LORELEI Program , 2019 .

[30]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[31]  Mohammad Sadegh Rasooli,et al.  Cross-Lingual Syntactic Transfer with Limited Resources , 2017, Transactions of the Association for Computational Linguistics.

[32]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[33]  Heng Ji,et al.  Platforms for Non-speakers Annotating Names in Any Language , 2018, ACL.

[34]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[35]  James Mayfield,et al.  Dragonfly: Advances in Non-Speaker Annotation for Low Resource Languages , 2020, LREC.

[36]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[37]  Kevin Knight,et al.  Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation , 2019, ACL.

[38]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[39]  Stephen D. Mayhew,et al.  Cheap Translation for Cross-Lingual Named Entity Recognition , 2017, EMNLP.

[40]  Shrikanth Narayanan,et al.  ELISA System Description for LoReHLT 2017 , 2017 .