Leveraging supplementary transcriptions and transliterations via re-ranking

Grapheme-to-phoneme conversion (G2P) and machine transliteration are important tasks in natural language processing. Supplemental data can often help resolve difficult ambiguities: existing transliterations of the same word can help choose among a G2P system’s candidate output transcriptions; similarly, transliterations from other languages can help choose among candidate transliterations in a given language. Transcriptions can be leveraged in this way as well. In this thesis, I investigate the problem of applying supplemental data to improve G2P and machine transliteration results. I present a unified method for leveraging related transliteration or transcription data to improve the performance of a base G2P or machine transliteration system. My approach constructs features with the supplemental data, which are then used in an SVM re-ranker. This re-ranking approach is shown to work across multiple base systems and achieves error reductions ranging from 8% to 43% over state-of-the-art base systems in cases where supplemental

[1]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[3]  Grzegorz Kondrak,et al.  How do you pronounce your name? Improving G2P with transliterations , 2011, ACL.

[4]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[5]  Pushpak Bhattacharyya,et al.  Everybody loves a rich cousin: An empirical study of transliteration through bridge languages , 2010, NAACL.

[6]  Thomas Niesler,et al.  Data-driven phonetic comparison and conversion between south african, british and american English pronunciations , 2009, INTERSPEECH.

[7]  Haizhou Li,et al.  Report of NEWS 2010 Transliteration Mining Shared Task , 2010, NEWS@ACL.

[8]  Regina Barzilay,et al.  Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach , 2009, NAACL.

[9]  Sittichai Jiampojamarn,et al.  Grapheme-to-phoneme conversion and its application to transliteration , 2011 .

[10]  Philipp Koehn,et al.  Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[11]  Susan Fitt,et al.  Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[12]  Mi-Young Kim,et al.  Transliteration Generation and Mining with Limited Training Resources , 2010, NEWS@ACL.

[13]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[14]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[15]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[16]  Monojit Choudhury,et al.  A Diachronic Approach for Schwa Deletion in Indo Aryan Languages , 2004, SIGMORPHON@ACL.

[17]  Hai Zhao,et al.  Reranking with Multiple Features for Better Transliteration , 2010, NEWS@ACL.

[18]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[19]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[20]  Haizhou Li,et al.  Whitepaper of NEWS 2009 Machine Transliteration Shared Task , 2009, NEWS@IJCNLP.

[21]  Na'im R. Tyson,et al.  Prosodic rules for schwa-deletion in hindi text-to-speech synthesis , 2009, Int. J. Speech Technol..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Haizhou Li,et al.  Report of NEWS 2010 Transliteration Generation Shared Task , 2010, NEWS@ACL.

[24]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[25]  Jean-Pierre Martens,et al.  G2p conversion of names: what can we do (better)? , 2007, INTERSPEECH.

[26]  Mirella Lapata,et al.  Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[27]  Grzegorz Kondrak,et al.  Integrating Joint n-gram Features into a Discriminative Training Framework , 2010, HLT-NAACL.

[28]  Grzegorz Kondrak,et al.  DirecTL: a Language Independent Approach to Transliteration , 2009, NEWS@IJCNLP.

[29]  Grzegorz Kondrak,et al.  Letter-Phoneme Alignment: An Exploration , 2010, ACL.

[30]  Haizhou Li,et al.  Machine Transliteration: Leveraging on Third Languages , 2010, COLING.

[31]  Giuseppe Riccardi,et al.  Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[32]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[33]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[34]  Reinhard Kneser,et al.  Designing very compact decision trees for grapheme-to-phoneme transcription , 2001, INTERSPEECH.

[35]  Eiichiro Sumita,et al.  Transliteration Using a Phrase-Based Statistical Machine Translation System to Re-Score the Output of a Joint Multigram Model , 2010, NEWS@ACL.

[36]  Grzegorz Kondrak,et al.  Language identification of names with SVMs , 2010, HLT-NAACL.

[37]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[38]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[39]  Alan W. Black,et al.  Learning Pronunciation Dictionaries: Language Complexity and Word Selection Strategies , 2006, NAACL.

[40]  Hua Wu,et al.  Revisiting Pivot Language Approach for Machine Translation , 2009, ACL.

[41]  Qian Yang,et al.  Development of a phoneme-to-phoneme (p2p) converter to improve the grapheme-to-phoneme (g2p) conversion of names , 2006, LREC.