An Effective and Robust Framework for Transliteration Exploration

Transliteration is the process of proper name translation based on pronunciation. It is an important process in many multilingual natural language tasks. A common and essential component of transliteration approaches is a verification mechanism that tests if the two names in different languages are translations of each other. Although many transliteration systems have verification as a component, verification as a stand-alone problem is relatively new. In this paper, we propose a simple, effective and robust training framework for the task of verification. We show the many applications of the verification techniques. Our proposed method can operate on both phonemic and orthographic inputs. Our best results show that a simple, straightforward orthographic representation is sufficient and no complex training method is needed. It is effective because it achieves remarkable accuracies. It is robust because it is language-independent. We show that on Chinese and Korean our technique achieves equal error rate well below 1% and around 1% for Japanese using 2009 and 2010 NEWS transliteration generation share task dataset. Our results also show that the orthographic system outperforms the phonemic system. This is especially encouraging because the orthographic inputs are easier to generate and secondly, one does not need to resort to more complex training algorithm to achieve excellent results. This approach is integrated for proper name based cross lingual information retrieval without translation. 1

[1]  Long Jiang,et al.  Named Entity Translation with Web Mining and Transliteration , 2007, IJCAI.

[2]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[3]  Noriko Kando,et al.  Overview of the NTCIR-7 ACLIA IR4QA Task , 2008, NTCIR.

[4]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[5]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[7]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[8]  Haizhou Li,et al.  Report of NEWS 2010 Transliteration Mining Shared Task , 2010, NEWS@ACL.

[9]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[10]  Haizhou Li,et al.  A phonetic similarity model for automatic extraction of transliteration pairs , 2007, TALIP.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[13]  Salim Roukos,et al.  A Maximum Entropy Word Aligner for Arabic-English Machine Translation , 2005, HLT.

[14]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[15]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[16]  Ea-Ee Jan,et al.  Transliteration Retrieval Model for Cross Lingual Information Retrieval , 2010, AIRS.

[17]  Bhuvana Ramabhadran,et al.  Multilingual Spoken Term Detection: Finding and Testing New Pronunciations , 2008 .

[18]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[19]  Yaser Al-Onaizan,et al.  Translating Named Entities Using Monolingual and Bilingual Resources , 2002, ACL.

[20]  Haizhou Li,et al.  Report of NEWS 2010 Transliteration Generation Shared Task , 2010, NEWS@ACL.

[21]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.