Mining English-Chinese Named Entity Pairs from Comparable Corpora

Bilingual Named Entity (NE) pairs are valuable resources for many NLP applications. Since comparable corpora are more accessible, abundant and up-to-date, recent researches have concentrated on mining bilingual lexicons using comparable corpora. Leveraging comparable corpora, this research presents a novel approach to mining English-Chinese NE translations by combining multi-dimension features from various information sources for every possible NE pair, which include the transliteration model, English-Chinese matching, Chinese-English matching, translation model, length, and context vector. These features are integrated into one model with linear combination and minimum sample risk (MSR) algorithm. As for the high type-dependence of NE translation, we integrate different features according to different NE types. We experiment with the above individual feature or integrated features to mine person NE (PN) pairs, location NE (LN) pairs and organization NE (ON) pairs. When using transliteration and length to mine PN pairs, we achieve the best performance of 84.9% (F-score). The LN pairs can be mined with the features of transliteration model, length, translation model, English-Chinese matching and Chinese-English matching. And the best performance is 83.4% (F-score). The ON pairs can be mined with the features of English-Chinese matching and Chinese-English matching. It reaches the best performance with 84.1% (F-score).

[1]  Long Jiang,et al.  Named Entity Translation with Web Mining and Transliteration , 2007, IJCAI.

[2]  Dan Roth,et al.  Named Entity Transliteration and Discovery from Multilingual Comparable Corpora , 2006 .

[3]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[4]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[5]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[6]  Wai Lam,et al.  Named entity translation matching and learning: With application for mining unseen translations , 2007, TOIS.

[7]  Haitao Yu,et al.  Mining Large-scale Comparable Corpora from Chinese-English News Collections , 2010, COLING.

[8]  Tao Tao,et al.  Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation , 2006, EMNLP.

[9]  Chengqing Zong,et al.  A Structure-Based Model for Chinese Organization Name Translation , 2008, TALIP.

[10]  Huang De-gen Chinese Word Segmentation Based on the Marginal Probabilities Generated by CRFs , 2009 .

[11]  Wei Yuan A Study and Improvement of Minimum Sample Risk Methods for Language Modeling , 2007 .

[12]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[13]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[14]  Dan Roth,et al.  Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora , 2006, ACL.

[15]  Wei Yuan,et al.  Minimum Sample Risk Methods for Language Modeling , 2005, HLT/EMNLP.

[16]  Martin Braschler,et al.  Multilingual Information Retrieval Based on Document Alignment Techniques , 1998, ECDL.

[17]  Jun Zhao,et al.  Multi-feature Based Chinese-English Named Entity Extraction from Comparable Corpora , 2006, PACLIC.