Cheap Translation for Cross-Lingual Named Entity Recognition

Recent work in NLP has attempted to deal with low-resource languages but still assumed a resource level that is not present for most languages, e.g., the availability of Wikipedia in the target language. We propose a simple method for cross-lingual named entity recognition (NER) that works well in settings with very minimal resources. Our approach makes use of a lexicon to “translate” annotated data available in one or several high resource language(s) into the target language, and learns a standard monolingual NER model there. Further, when Wikipedia is available in the target language, our method can enhance Wikipedia based methods to yield state-of-the-art NER results; we evaluate on 7 diverse languages, improving the state-of-the-art by an average of 5.5% F1 points. With the minimal resources required, this is an extremely portable cross-lingual NER approach, as illustrated using a truly low-resource language, Uyghur.

[1]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[2]  Oscar Täckström Nudging the Envelope of Direct Transfer Methods for Multilingual Named Entity Recognition , 2012, HLT-NAACL 2012.

[3]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[4]  Dan Roth,et al.  Cross-lingual Wikification Using Multilingual Embeddings , 2016, NAACL.

[5]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[7]  Oren Etzioni,et al.  Panlingual lexical translation via probabilistic inference , 2010, Artif. Intell..

[8]  Philipp Koehn,et al.  Knowledge Sources for Word-Level Translation Models , 2001, EMNLP.

[9]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[10]  Kristina Toutanova,et al.  Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia , 2012, ACL.

[11]  Jaime G. Carbonell,et al.  Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings , 2016, EMNLP.

[12]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[13]  Karin M. Verspoor,et al.  What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages , 2014, EMNLP.

[14]  Christopher D. Manning,et al.  Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning , 2014, TACL.

[15]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[16]  Chris Callison-Burch,et al.  End-to-end statistical machine translation with zero or small parallel texts , 2016, Nat. Lang. Eng..

[17]  Jonathan Pool,et al.  PanLex: Building a Resource for Panlingual Lexical Translation , 2014, LREC.

[18]  Xavier Carreras,et al.  Named Entity Recognition For Catalan Using Only Spanish Resources and Unlabelled Data , 2003, EACL.

[19]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[20]  Stephen D. Mayhew,et al.  Cross-Lingual Named Entity Recognition via Wikification , 2016, CoNLL.

[21]  Ralf Steinberger,et al.  Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection , 2011, RANLP.

[22]  Imed Zitouni,et al.  Mention Detection Crossing the Language Barrier , 2008, EMNLP.

[23]  Francis M. Tyers,et al.  Collaboration: interoperability between people in the creation of language resources for less-resourced languages , 2008 .

[24]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[25]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[26]  Jian Ni,et al.  Improving Multilingual Named Entity Recognition with Wikipedia Entity Type Mapping , 2016, EMNLP.

[27]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[28]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.