Accurate Unsupervised Joint Named-Entity Extraction from Unaligned Parallel Text

We present a new approach to named-entity recognition that jointly learns to identify named-entities in parallel text. The system generates seed candidates through local, cross-language edit likelihood and then bootstraps to make broad predictions across both languages, optimizing combined contextual, word-shape and alignment models. It is completely unsupervised, with no manually labeled items, no external resources, only using parallel text that does not need to be easily alignable. The results are strong, with F > 0.85 for purely unsupervised named-entity recognition across languages, compared to just F = 0.35 on the same data for supervised cross-domain named-entity recognition within a language. A combination of unsupervised and supervised methods increases the accuracy to F = 0.88. We conclude that we have found a viable new strategy for unsupervised named-entity recognition across low-resource languages and for domain-adaptation within high-resource languages.

[1]  Zhiyi Song,et al.  Entity Translation and Alignment in the ACE-07 ET Task , 2008, LREC.

[2]  Robert Munro,et al.  Subword and Spatiotemporal Models for Identifying Actionable Information in Haitian Kreyol , 2011, CoNLL.

[3]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[4]  Marcin Sydow,et al.  On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages , 2009, Information Retrieval.

[5]  Miriam Butt,et al.  Intelligent linguistic architectures : varations on themes by Ronald M. Kaplan , 2006 .

[6]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[7]  William Lewis,et al.  Haitian Creole: How to Build and Ship an MT Engine from Scratch in 4 days, 17 hours, & 30 minutes , 2010, EAMT.

[8]  Ted Pedersen,et al.  An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-occurrence Features , 2006, CICLing.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Heng Ji,et al.  Unsupervised Language-Independent Name Translation Mining from Wikipedia Infoboxes , 2011, ULNLP@EMNLP.

[11]  Brian Roark,et al.  Morphological Analysis by Multiple Sequence Alignment , 2009, CLEF.

[12]  Fredric C. Gey,et al.  Proceedings of LREC , 2010 .

[13]  William Lewis,et al.  Crisis MT: Developing A Cookbook for MT in Crisis Situations , 2011, WMT@EMNLP.

[14]  Christopher D. Manning,et al.  Subword Variation in Text Message Classification , 2010, NAACL.

[15]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[16]  Bruno Pouliquen,et al.  Cross-lingual Named Entity Recognition , 2007 .

[17]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[18]  Alexander H. Waibel,et al.  Improving Named Entity Translation Combining Phonetic and Semantic Similarities , 2004, NAACL.

[19]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.

[20]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[21]  Tao Tao,et al.  Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation , 2006, EMNLP.

[22]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.