Mining the Web for Transliteration Lexicons: Joint-Validation Approach

The Web provides the largest data collection, which reflects language use in daily life. With the advent of new technology and the flood of information on the Web, it has become quite common to create new terms supporting new concepts and translate these terms into non-Latin languages with "transliteration" referring to "translation by sound". Cross-language natural language processing applications, such as machine translation and cross-language information retrieval, usually need a translation dictionary, which affects the quality of the applications. However; the transliteration lexicons are usually unregistered in the translation dictionary. To address the problem, we present a transliteration lexicon acquisition model that mines the Web for transliteration lexicons. In this paper, we describe techniques of comparing phonetic-similarity to recognize transliteration pair candidates on the Web and of finding the correct transliteration pairs based on joint-validation. The techniques were evaluated against manually constructed transliteration lexicons. Our experiments revealed that the techniques effectively found transliteration lexicons on the Web

[1]  Key-Sun Choi,et al.  Recognizing Transliteration Equivalence for Enriching Domain-Specific Thesauri , 2006 .

[2]  Gregory Grefenstette,et al.  Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation , 2004, ACL.

[3]  Keita Tsuji Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs , 2002, Int. J. Comput. Process. Orient. Lang..

[4]  Jenq-Haur Wang,et al.  Exploiting the Web as the multilingual corpus for unknown query translation , 2006, J. Assoc. Inf. Sci. Technol..

[5]  Eric Brill,et al.  Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs , 2001, NLPRS.

[6]  Jason S. Chang,et al.  Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts using a Statistical Machine Transliteration Model , 2003, ParallelTexts@NAACL-HLT.

[7]  Slaven Bilac,et al.  EXTRACTING TRANSLITERATION PAIRS FROM COMPARABLE CORPORA , 2005 .

[8]  Hsin-Hsi Chen,et al.  Backward Machine Transliteration by Learning Phonetic Similarity , 2002, CoNLL.

[9]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[10]  Tetsuya Ishikawa,et al.  Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration , 2001, Comput. Humanit..

[11]  Gregory Grefenstette,et al.  Mining the Web to Create a Language Model for Mapping between English Names and Phrases and Japanese , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[12]  Hsi-Jian Lee,et al.  Translation of web queries using anchor text mining , 2002, TALIP.

[13]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries: A transitive translation approach , 2004, TOIS.