Translating Chinese Romanized Name into Chinese Idiographic Characters via Corpus and Web Validation

Cross-language information retrieval performance depends on the quality of the translation resources used to pass from a user’s source language query to target language documents. Translation lists of proper names are rare but vital resources for cross-language retrieval between languages using different character sets. Named entities translation dictionaries can be extracted from bilingual corpus with some degree of success, but the problem of the coverage of these scarce bilingual corpora remains. In this article, we present a technique for finding Chinese transliterations for any Chinese name written in English script. Our system performs transliteration of Pinyin (the standard Romanization for Chinese) to Chinese characters via corpus and web validation. Though Chinese family names form a small set, the number and variety of multisyllabic first names is great, and treatment is complicated by the fact that one Pinyin transliteration can correspond to hundred of different Chinese characters. Our method finds the best translations of a Chinese name written in Pinyin by filtering out unlikely translations using a bigram model derived from a very large monolingual Chinese corpus, and then vetting remaining candidate transliterations using Web statistics. We experimentally validate our method using an independent gold standard. RESUME. La performance en recherche d'information translingue depend de la qualite des ressources de traduction utilisees pour passer de la langue source (requete d'utilisateur) vers la langue cible des documents. Les listes de traduction de noms de personnes sont rares, et constituent en meme temps des ressources essentielles pour la recherche d'information translingue entre des langues utilisant des jeux de caracteres differents. Les dictionnaires de traduction d'entites nommees peuvent etre extraits des corpus bilingues avec un certain succes, mais le probleme du recouvrement de ces corpus bilingues, rares, reste present. Dans cet article, nous presentons une technique pour retrouver la translitteration en chinois de tous les noms chinois ecrits en anglais. Notre systeme effectue la translitteration du Pinyin (la romanisation standard du chinois) en caracteres chinois via des validations effectuee sur corpus et sur le Web. Bien que les noms de famille en chinois constituent un ensemble peu important, les varietes des prenoms multi-syllabiques sont tres importantes. Le traitement s'avere d'autant plus complique qu'a une translitteration du Pinyin peut correspondre jusqu'a plus de cent caracteres chinois differents. Notre methode selectionne la meilleure traduction des noms chinois ecrits en Pinyin en filtrant les traductions impossibles et en utilisant un modele de bigrammes extrait d'un tres grand corpus chinois monolingue, puis en eliminant les traductions candidates restantes a l'aide de statistiques recueillies sur le Web. Nous avons evalue notre methode en utilisant une reference independante.

[1]  Fausto Rabitti,et al.  Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval , 1986 .

[2]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[3]  Gregory Grefenstette,et al.  Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation , 2004, ACL.

[4]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[5]  Gregory Grefenstette Evaluating the adequacy of a multilingual transfer dictionary for the cross language information retrieval , 1998 .

[6]  Stephan Vogel,et al.  Improved named entity translation and bilingual named entity extraction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[7]  Wei Gao,et al.  Phoneme-Based Transliteration of Foreign Names for OOV Problem , 2004, IJCNLP.

[8]  Ming Zhou,et al.  A New Approach for English-Chinese Named Entity Alignment , 2004, EMNLP.

[9]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[10]  Douglas W. Oard,et al.  Evaluating Lexicon Coverage for Cross-Language Information Retrieval , 2000 .

[11]  Jason S. Chang,et al.  Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts using a Statistical Machine Transliteration Model , 2003, ParallelTexts@NAACL-HLT.

[12]  Karin M. Verspoor,et al.  Automatic English-Chinese name transliteration for development of multilingual resources , 1998, ACL.