Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration

In this paper, we present HeiNER, the multilingual Heidelberg Named Entity Resource. HeiNER contains 1,547,586 disambiguated English Named Entities together with translations and transliterations to 15 languages. Our work builds on the approach described in (Bunescu and Pasca, 2006), yet extends it to a multilingual dimension. Translating Named Entities into the various target languages is carried out by exploiting crosslingual information contained in the online encyclopedia Wikipedia. In addition, HeiNER provides linguistic contexts for every NE in all target languages which makes it a valuable resource for multilingual Named Entity Recognition, Disambiguation and Classification. The results of our evaluation against the assessments of human annotators yield a high precision of 0.95 for the NEs we extract from the English Wikipedia. These source language NEs are thus very reliable seeds for our multilingual NE translation method.

[1]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[2]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[3]  Razvan C. Bunescu,et al.  Learning for information extraction: from named entity recognition and disambiguation to relation extraction , 2007 .

[4]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[5]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[6]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[7]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[8]  Nerea Ezeiza,et al.  Named Entities Translation Based on Comparable Corpora , 2006, Workshop On Multi-Word-Expressions In A Multilingual Context.

[9]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[10]  Anne-Laure Ligozat,et al.  Evaluation and Improvement of Cross-Lingual Question AnsweringStrategies , 2006 .

[11]  Max Mühlhäuser,et al.  Analyzing and accessing Wikipedia as a lexical semantic resource , 2007 .

[12]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[13]  Sergei Nirenburg,et al.  The Proper Place of Men and Machines in Language Translation , 2003 .

[14]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[15]  MARTIN KAY The Proper Place of Men and Machines in Language Translation , 2004, Machine Translation.

[16]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[17]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[18]  Daniel M. Dunlavy,et al.  SEMISUPERVISED NAMED ENTITY RECOGNITION , 2009 .

[19]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[20]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[21]  Andreas Eisele First Steps towards Multi-Engine Machine Translation , 2005, ParallelText@ACL.

[22]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[23]  Wei Gao,et al.  Phoneme-Based Transliteration of Foreign Names for OOV Problem , 2004, IJCNLP.