Entity Translation Mining from Comparable Corpora: Combining Graph Mapping with Corpus Latent Features

This paper addresses the problem of mining named entity translations from comparable corpora, specifically, mining English and Chinese named entity translation. We first observe that existing approaches use one or more of the following named entity similarity metrics: entity, entity context, and relationship. Motivated by this observation, we propose a new holistic approach by 1) combining all similarity types used and 2) additionally considering relationship context similarity between pairs of named entities, a missing quadrant in the taxonomy of similarity metrics. We abstract the named entity translation problem as the matching of two named entity graphs extracted from the comparable corpora. Specifically, named entity graphs are first constructed from comparable corpora to extract relationship between named entities. Entity similarity and entity context similarity are then calculated from every pair of bilingual named entities. A reinforcing method is utilized to reflect relationship similarity and relationship context similarity between named entities. We also discover "latent" features lost in the graph extraction process and integrate this into our framework. According to our experimental results, our holistic graph-based approach and its enhancement using corpus latent features are highly effective and our framework significantly outperforms previous approaches.

[1]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[2]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[3]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[4]  Karin M. Verspoor,et al.  Automatic English-Chinese name transliteration for development of multilingual resources , 1998, ACL.

[5]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[6]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[7]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[8]  Sanjeev Khudanpur,et al.  Transliteration of proper names in cross-language applications , 2003, SIGIR.

[9]  Changning Huang,et al.  Improved Source-Channel Models for Chinese Word Segmentation , 2003, ACL.

[10]  Alexander H. Waibel,et al.  Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-Feature Cost Minimization , 2003, NER@ACL.

[11]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[12]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[13]  Ming Zhou,et al.  A New Approach for English-Chinese Named Entity Alignment , 2004, EMNLP.

[14]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[15]  Tao Tao,et al.  Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation , 2006, EMNLP.

[16]  Long Jiang,et al.  Named Entity Translation with Web Mining and Transliteration , 2007, IJCAI.

[17]  Benjamin Van Durme,et al.  Mining Parenthetical Translations from the Web by Word Alignment , 2008, ACL.

[18]  Qingsheng Zhu,et al.  Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.

[19]  Junichi Tsujii,et al.  Bilingual Dictionary Extraction from Wikipedia , 2009, MTSUMMIT.

[20]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[21]  Seung-won Hwang,et al.  Mining Name Translations from Entity Graph Mapping , 2010, EMNLP.

[22]  Vincent Ng,et al.  Coreference Resolution with World Knowledge , 2011, ACL.

[23]  Heng Ji,et al.  Mining Name Translations from Comparable Corpora by Creating Bilingual Information Networks , 2009, BUCC@ACL/IJCNLP.

[24]  Vincent Ng,et al.  Ensemble-Based Coreference Resolution , 2011, IJCAI.