Machine transliteration survey

Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation. The development of algorithms specifically for machine transliteration began over a decade ago based on the phonetics of source and target languages, followed by approaches using statistical and language-specific methods. In this survey, we review the key methodologies introduced in the transliteration literature. The approaches are categorized based on the resources and algorithms used, and the effectiveness is compared.

[1]  Hiroshi Nakagawa,et al.  Web-based acquisition of Japanese katakana variants , 2005, SIGIR '05.

[2]  Jennifer Pearson,et al.  Terms in context , 1998 .

[3]  Wai Lam,et al.  Named entity translation matching and learning: With application for mining unseen translations , 2007, TOIS.

[4]  Kevin Knight A Statistical MT Tutorial Workbook , 2003 .

[5]  Michael A. Covington,et al.  An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[6]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[7]  Dan Roth,et al.  Part of Speech Tagging Using a Network of Linear Separators , 1998, ACL.

[8]  Hitoshi Isahara,et al.  A machine transliteration model based on correspondence between graphemes and phonemes , 2006, TALIP.

[9]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[10]  Mark J. F. Gales,et al.  Speech Recognition System Combination for Machine Translation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Hans van Halteren,et al.  Improving Data Driven Wordclass Tagging by System Combination , 1998, ACL.

[12]  Falk Scholer,et al.  Corpus Effects on the Evaluation of Automated Transliteration Systems , 2007, ACL.

[13]  Hozumi Tanaka,et al.  Direct Combination of Spelling and Pronunciation Information for Robust Back-Transliteration , 2005, CICLing.

[14]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[15]  Haizhou Li,et al.  Transliteration Alignment , 2009, ACL.

[16]  Hitoshi Isahara,et al.  Machine transliteration using multiple transliteration engines and hypothesis re-ranking , 2007, MTSUMMIT.

[17]  Hozumi Tanaka,et al.  Improving Back-Transliteration by Combining Information Sources , 2004, IJCNLP.

[18]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[19]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[20]  Hsi-Jian Lee,et al.  Translation of web queries using anchor text mining , 2002, TALIP.

[21]  Hsin-Hsi Chen,et al.  Translating-transliterating named entities for multilingual information access , 2006, J. Assoc. Inf. Sci. Technol..

[22]  Krister Lindén Multilingual modeling of cross-lingual spelling variants , 2006, Information Retrieval.

[23]  Eric Brill,et al.  Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs , 2001, NLPRS.

[24]  Jyh-Shing Roger Jang,et al.  Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources , 2006, TALIP.

[25]  Chung-Chian Hsu,et al.  Boosted voting for confirming synonymous transliteration , 2008, 2008 International Conference on Information and Automation.

[26]  Stefan Th. Gries,et al.  What is Corpus Linguistics? , 2009, Lang. Linguistics Compass.

[27]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[28]  Hopkins UniversityBaltimore Exploiting Diversity in Natural Language Processing: Combining Parsers , 1999 .

[29]  Naoto Katoh,et al.  Back Transliteration from Japanese to English using Target English Context , 2004, COLING.

[30]  Wei Gao,et al.  Phoneme-Based Transliteration of Foreign Names for OOV Problem , 2004, IJCNLP.

[31]  Anthony J. Vitale,et al.  Algorithms for Grapheme-Phoneme Translation for English and French: Applications for Database Searches and Speech Synthesis , 1997, CL.

[32]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[33]  Falk Scholer,et al.  Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back-Transliteration , 2007, ACL.

[34]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[35]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[36]  Dmitry Zelenko,et al.  Discriminative Methods for Transliteration , 2006, EMNLP.

[37]  Hitoshi Isahara,et al.  Mining the Web for Transliteration Lexicons: Joint-Validation Approach , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[38]  Jason S. Chang,et al.  Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts using a Statistical Machine Transliteration Model , 2003, ParallelTexts@NAACL-HLT.

[39]  Ying Zhang,et al.  Mining translations of OOV terms from the web through cross-lingual query expansion , 2005, SIGIR '05.

[40]  Ted Pedersen,et al.  A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation , 2000, ANLP.

[41]  Frank Smadja How to Compile a Bilingual Collocational Lexicon . Automatically , 1992 .

[42]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[43]  Leah S. Larkey,et al.  Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[44]  Wai Lam,et al.  Learning phonetic similarity for matching named entity translations and mining new translations , 2004, SIGIR '04.

[45]  Alexander H. Waibel,et al.  Clustering and Classifying Person Names by Origin , 2005, AAAI.

[46]  Tadashi Nomoto Multi-Engine Machine Translation with Voted Language Model , 2004, ACL.

[47]  András Kocsor,et al.  Classifier Combination Schemes in Speech Impediment Therapy Systems , 2005, Acta Cybern..

[48]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[49]  Kalervo Järvelin,et al.  FITE-TRT: a high quality translation technique for OOV words , 2006, SAC '06.

[50]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[51]  David Crystal,et al.  How Language Works: How Babies Babble, Words Change Meaning, and Languages Live or Die , 2006 .

[52]  Stephan Vogel,et al.  Improved named entity translation and bilingual named entity extraction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[53]  Frank K. Soong,et al.  Identifying Language Origin of Named Entity With Multiple Information Sources , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[54]  Hermann Ney,et al.  Computing Consensus Translation for Multiple Machine Translation Systems Using Enhanced Hypothesis Alignment , 2006, EACL.

[55]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[56]  Nerea Ezeiza,et al.  Named Entities Translation Based on Comparable Corpora , 2006, Workshop On Multi-Word-Expressions In A Multilingual Context.

[57]  Haizhou Li,et al.  Harvesting Regional Transliteration Variants with Guided Search , 2009, ICCPOL.

[58]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[59]  Robert Dale,et al.  Charting Democracy Across Parsers , 2007, ALTA.

[60]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[61]  Muhammad Ghulam Abbas Malik,et al.  Punjabi Machine Transliteration , 2006, ACL.

[62]  Jian Su,et al.  Direct Orthographical Mapping for Machine Transliteration , 2004, COLING.

[63]  Hsin-Hsi Chen,et al.  Translating–transliterating named entities for multilingual information access , 2006 .

[64]  Kalervo Järvelin,et al.  Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules , 2007, TOIS.

[65]  Grzegorz Kondrak,et al.  Substring-Based Transliteration , 2007, ACL.

[66]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[67]  Christopher D. Manning,et al.  Extentions to HMM-based Statistical Word Alignment Models , 2002, EMNLP.

[68]  Jyh-Shing Roger Jang,et al.  Extraction of transliteration pairs from parallel corpora using a statistical transliteration model , 2006, Inf. Sci..

[69]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[70]  Eunok Paek,et al.  An English to Korean Transliteration Model of Extended Markov Window , 2000, COLING.

[71]  In-Ho Kang,et al.  English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks , 2000, COLING.

[72]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[73]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[74]  Sarvnaz Karimi,et al.  Machine transliteration of proper names between English and Persian , 2008 .

[75]  Hitoshi Isahara,et al.  A Hybrid Model for Extracting Transliteration Equivalents from Parallel Corpora , 2006, TSD.

[76]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[77]  Key-Sun Choi,et al.  Recognizing Transliteration Equivalence for Enriching Domain-Specific Thesauri , 2006 .

[78]  Long Jiang,et al.  Named Entity Translation with Web Mining and Transliteration , 2007, IJCAI.

[79]  Hsin-Hsi Chen,et al.  A High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Information and Web Statistics , 2006, ACL.

[80]  Hozumi Tanaka,et al.  A hybrid back-transliteration system for Japanese , 2004, COLING.

[81]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[82]  Hsin-Hsi Chen,et al.  Backward Machine Transliteration by Learning Phonetic Similarity , 2002, CoNLL.

[83]  Sivaji Bandyopadhyay,et al.  A Modified Joint Source-Channel Model for Transliteration , 2006, ACL.

[84]  Kalervo Järvelin,et al.  A Novel Implementation of the FITE-TRT Translation Method , 2008, ECIR.

[85]  Fred Popowich,et al.  Automatic Transliteration of Proper Nouns from Arabic to English , 2006, BCS.

[86]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[87]  Key-Sun Choi,et al.  Automatic Transliteration and Back-transliteration by Decision Tree Learning , 2000, LREC.

[88]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[89]  Mansur Arbabi,et al.  Algorithms for Arabic name transliteration , 1994, IBM J. Res. Dev..

[90]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[91]  Key-Sun Choi,et al.  Machine Learning Based English-to-Korean Transliteration Using Grapheme and Phoneme Information , 2005, IEICE Trans. Inf. Syst..

[92]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[93]  Jian Su,et al.  Mining Live Transliterations Using Incremental Learning Algorithms , 2008, Int. J. Comput. Process. Orient. Lang..

[94]  K. Ohe,et al.  Support vector machine based orthographic disambiguation , 2007, TMI.

[95]  Grzegorz Kondrak,et al.  Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction , 2007, ACL.

[96]  Wei Gao,et al.  Improving Transliteration with Precise Alignment of Phoneme Chunks and Using Contextual Features , 2004, AIRS.

[97]  Jin-Shea Kuo,et al.  Constructing Transliteration Lexicons from Web Corpora , 2004, ACL.

[98]  Sung-Hyon Myaeng,et al.  Automatic identification and back-transliteration of foreign words for information retrieval , 1999, Inf. Process. Manag..

[99]  Sanjeev Khudanpur,et al.  Transliteration of proper names in cross-language applications , 2003, SIGIR.

[100]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[101]  Kalervo Järvelin,et al.  Translating cross-lingual spelling variants using transformation rules , 2005, Inf. Process. Manag..

[102]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[103]  Kenji Suzuki,et al.  Using the Web as a Bilingual Dictionary , 2001, DDMMT@ACL.

[104]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[105]  Pascale Fung,et al.  A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups , 2004, Machine Translation.

[106]  Jong-Hoon Oh,et al.  Validating Transliteration Hypotheses Using the Web: Web Counts vs. Web Mining , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[107]  Eric Brill,et al.  Exploiting Diversity in Natural Language Processing: Combining Parsers , 1999, EMNLP.

[108]  Kyo Kageura,et al.  Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules , 2002, LREC.

[109]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[110]  Kazuhiko Ohe,et al.  Orthographic Disambiguation Incorporating Transliterated Probability , 2008, IJCNLP.

[111]  Pim van der Eijk Automating the Acquisition of Bilingual Terminology , 1993, EACL.

[112]  LiLi Xu,et al.  Modeling Impression in Probabilistic Transliteration into Chinese , 2006, EMNLP.

[113]  Giuseppe Riccardi,et al.  Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[114]  Hitoshi Isahara,et al.  Improving Machine Transliteration Performance by Using Multiple Transliteration Models , 2006, ICCPOL.

[115]  Dan Roth,et al.  Active Sample Selection for Named Entity Transliteration , 2008, ACL.

[116]  Richard M. Schwartz,et al.  Combining Outputs from Multiple Machine Translation Systems , 2007, NAACL.

[117]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[118]  Haizhou Li,et al.  Active learning for constructing transliteration lexicons from the Web , 2008, J. Assoc. Inf. Sci. Technol..

[119]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[120]  Key-Sun Choi,et al.  An English-Korean Transliteration Model Using Pronunciation and Contextual Rules , 2002, COLING.

[121]  Fei Huang Cluster-specific Named Entity Transliteration , 2005, HLT/EMNLP.

[122]  Ariadna Font Llitjós,et al.  Knowledge of language origin improves pronunciation accuracy of proper names , 2001, INTERSPEECH.

[123]  Zhang Min,et al.  Direct orthographical mapping for machine transliteration , 2004, COLING 2004.

[124]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[125]  Haizhou Li,et al.  A phonetic similarity model for automatic extraction of transliteration pairs , 2007, TALIP.

[126]  Jason S. Chang,et al.  Learning to Find English to Chinese Transliterations on the Web , 2007, EMNLP-CoNLL.

[127]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[128]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[129]  Key-Sun Choi,et al.  An ensemble of transliteration models for information retrieval , 2006, Inf. Process. Manag..

[130]  Karin M. Verspoor,et al.  Automatic English-Chinese name transliteration for development of multilingual resources , 1998, ACL.

[131]  Dan Roth,et al.  Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora , 2006, ACL.

[132]  Haizhou Li,et al.  Semantic Transliteration of Personal Names , 2007, ACL.

[133]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[134]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[135]  Tao Tao,et al.  Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation , 2006, EMNLP.

[136]  Fei Huang,et al.  Hierarchical System Combination for Machine Translation , 2007, EMNLP.

[137]  Shahram Khadivi,et al.  A Sequence Alignment Model Based on the Averaged Perceptron , 2007, EMNLP.

[138]  Falk Scholer,et al.  English to Persian Transliteration , 2006, SPIRE.

[139]  Yaser Al-Onaizan,et al.  Translating Named Entities Using Monolingual and Bilingual Resources , 2002, ACL.

[140]  JuholaMartti,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007 .