Harvesting Regional Transliteration Variants with Guided Search

This paper proposes a method to harvest regional transliteration variants with guided search. We first study how to incorporate transliteration knowledge into query formulation so as to significantly increase the chance of desired transliteration returns. Then, we study a cross-training algorithm, which explores valuable information across different regional corpora for the learning of transliteration models to in turn improve the overall extraction performance. The experimental results show that the proposed method not only effectively harvests a lexicon of regional transliteration variants but also mitigates the need of manual data labeling for transliteration modeling. We also conduct an inquiry into the underlying characteristics of regional transliterations that motivate the cross-training algorithm.

[1]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[2]  Sunita Sarawagi,et al.  Cross-training: learning probabilistic mappings between topics , 2003, KDD '03.

[3]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[4]  Nuanwan Soonthornphisaj,et al.  Iterative cross-training: An algorithm for learning from unlabeled Web pages , 2004 .

[5]  Benjamin Van Durme,et al.  Mining Parenthetical Translations from the Web by Word Alignment , 2008, ACL.

[6]  Key-Sun Choi,et al.  An Ensemble of Grapheme and Phoneme for Machine Transliteration , 2005, IJCNLP.

[7]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[8]  Oi Yee Kwong,et al.  Regional Variation of Domain-Specific Lexical Items: Toward a Pan-Chinese Lexical Resource , 2006, SIGHAN@COLING/ACL.

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[11]  Haizhou Li,et al.  A phonetic similarity model for automatic extraction of transliteration pairs , 2007, TALIP.

[12]  Ming-Wei Chang,et al.  Guiding Semi-Supervision with Constraint-Driven Learning , 2007, ACL.

[13]  Pu-Jen Cheng,et al.  Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora , 2004, ACL.

[14]  Haizhou Li,et al.  Semantic Transliteration of Personal Names , 2007, ACL.

[15]  Eric Brill,et al.  Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs , 2001, NLPRS.

[16]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.