论文信息 - Leveraging supplementary transcriptions and transliterations via re-ranking - 字舞流文

Leveraging supplementary transcriptions and transliterations via re-ranking

Grapheme-to-phoneme conversion (G2P) and machine transliteration are important tasks in natural language processing. Supplemental data can often help resolve difficult ambiguities: existing transliterations of the same word can help choose among a G2P system’s candidate output transcriptions; similarly, transliterations from other languages can help choose among candidate transliterations in a given language. Transcriptions can be leveraged in this way as well. In this thesis, I investigate the problem of applying supplemental data to improve G2P and machine transliteration results. I present a unified method for leveraging related transliteration or transcription data to improve the performance of a base G2P or machine transliteration system. My approach constructs features with the supplemental data, which are then used in an SVM re-ranker. This re-ranking approach is shown to work across multiple base systems and achieves error reductions ranging from 8% to 43% over state-of-the-art base systems in cases where supplemental

Aditya Bhargava | Aditya Bhargava

[1] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2] F ROSENBLATT,et al. The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[3] Grzegorz Kondrak,et al. How do you pronounce your name? Improving G2P with transliterations , 2011, ACL.

[4] Hermann Ney,et al. Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[5] Pushpak Bhattacharyya,et al. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages , 2010, NAACL.

[6] Thomas Niesler,et al. Data-driven phonetic comparison and conversion between south african, british and american English pronunciations , 2009, INTERSPEECH.

[7] Haizhou Li,et al. Report of NEWS 2010 Transliteration Mining Shared Task , 2010, NEWS@ACL.

[8] Regina Barzilay,et al. Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach , 2009, NAACL.

[9] Sittichai Jiampojamarn,et al. Grapheme-to-phoneme conversion and its application to transliteration , 2011 .

[10] Philipp Koehn,et al. Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[11] Susan Fitt,et al. Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[12] Mi-Young Kim,et al. Transliteration Generation and Mining with Limited Training Resources , 2010, NEWS@ACL.

[13] Grzegorz Kondrak,et al. Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[14] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[15] R. H. Baayen,et al. The CELEX Lexical Database (CD-ROM) , 1996 .

[16] Monojit Choudhury,et al. A Diachronic Approach for Schwa Deletion in Indo Aryan Languages , 2004, SIGMORPHON@ACL.

[17] Hai Zhao,et al. Reranking with Multiple Features for Better Transliteration , 2010, NEWS@ACL.

[18] Grzegorz Kondrak,et al. A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[19] Grzegorz Kondrak,et al. Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[20] Haizhou Li,et al. Whitepaper of NEWS 2009 Machine Transliteration Shared Task , 2009, NEWS@IJCNLP.

[21] Na'im R. Tyson,et al. Prosodic rules for schwa-deletion in hindi text-to-speech synthesis , 2009, Int. J. Speech Technol..

[22] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[23] Haizhou Li,et al. Report of NEWS 2010 Transliteration Generation Shared Task , 2010, NEWS@ACL.

[24] Hitoshi Isahara,et al. A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[25] Jean-Pierre Martens,et al. G2p conversion of names: what can we do (better)? , 2007, INTERSPEECH.

[26] Mirella Lapata,et al. Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[27] Grzegorz Kondrak,et al. Integrating Joint n-gram Features into a Discriminative Training Framework , 2010, HLT-NAACL.

[28] Grzegorz Kondrak,et al. DirecTL: a Language Independent Approach to Transliteration , 2009, NEWS@IJCNLP.

[29] Grzegorz Kondrak,et al. Letter-Phoneme Alignment: An Exploration , 2010, ACL.

[30] Haizhou Li,et al. Machine Transliteration: Leveraging on Third Languages , 2010, COLING.

[31] Giuseppe Riccardi,et al. Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[32] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[33] Alan W. Black,et al. Issues in building general letter to sound rules , 1998, SSW.

[34] Reinhard Kneser,et al. Designing very compact decision trees for grapheme-to-phoneme transcription , 2001, INTERSPEECH.

[35] Eiichiro Sumita,et al. Transliteration Using a Phrase-Based Statistical Machine Translation System to Re-Score the Output of a Joint Multigram Model , 2010, NEWS@ACL.

[36] Grzegorz Kondrak,et al. Language identification of names with SVMs , 2010, HLT-NAACL.

[37] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.

[38] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[39] Alan W. Black,et al. Learning Pronunciation Dictionaries: Language Complexity and Word Selection Strategies , 2006, NAACL.

[40] Hua Wu,et al. Revisiting Pivot Language Approach for Machine Translation , 2009, ACL.

[41] Qian Yang,et al. Development of a phoneme-to-phoneme (p2p) converter to improve the grapheme-to-phoneme (g2p) conversion of names , 2006, LREC.