Solving the Phoneme Conflict in Grapheme-to-Phoneme Conversion Using a Two-Stage Neural Network-Based Approach

SUMMARY To achieve high quality output speech synthesis systems, data-driven grapheme-to-phoneme (G2P) conversion is usually used to generate the phonetic transcription of out-of-vocabulary (OOV) words. To improve the performance of G2P conversion, this paper deals with the problem of conflicting phonemes, where an input grapheme can, in the same context, produce many possible output phonemes at the same time. To this end, we propose a two-stage neural network-based approach that converts the input text to phoneme sequences in the first stage and then predicts each output phoneme in the second stage using the phonemic information obtained. The first-stage neural network is fundamentally implemented as a many-to-many mapping model for automatic conversion of word to phoneme sequences, while the second stage uses a combination of the obtained phoneme sequences to predict the output phoneme corresponding to each input grapheme in a given word. We evaluate the performance of this approach using the American English words-based pronunciation dictionary known as the auto-aligned CMUDict corpus[1]. In terms of phoneme and word accuracy of the OOV words, on comparison with several proposed baseline approaches, the evaluation results show that our proposed approach improves on the previous one-stage neural network-based approach for G2P conversion. The results of comparison with another existing approach indicate that it provides higher phoneme accuracy but lower word accuracy on a general dataset, and slightly higher phoneme and word accuracy on a selection of words consisting of more than one phoneme

[1]  Grzegorz Kondrak,et al.  Online discriminative training for grapheme-to-phoneme conversion , 2009, INTERSPEECH.

[2]  Seng Kheang,et al.  Improving the performance of Letter-To-Phoneme conversion by using Two-Stage Neural Network , 2012 .

[3]  R. N. Indah Language and Speech , 1958, Nature.

[4]  Grzegorz Kondrak,et al.  Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion , 2008, ACL.

[5]  Keikichi Hirose,et al.  Improving WFST-based G2P Conversion with Alignment Constraints and RNNLM N-best Rescoring , 2012, INTERSPEECH.

[6]  Grzegorz Kondrak,et al.  Letter-Phoneme Alignment: An Exploration , 2010, ACL.

[7]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[8]  Enikö Beatrice Bilcu Text-To-Phoneme Mapping Using Neural Networks , 2008 .

[9]  Vincent Claveau Letter-to-phoneme conversion by inference of rewriting rules , 2009, INTERSPEECH.

[10]  Robert I. Damper Learning about speech from data: Beyond NETtalk , 2001 .

[11]  Tsuneo Nitta,et al.  Letter-to-Phoneme Conversion Based on Two-Stage Neural Network Focusing on Letter and Phoneme Contexts , 2011, INTERSPEECH.

[12]  Jaakko Astola,et al.  Neural networks with random letter codes for text-to-phoneme mapping and small training dictionary , 2006, 2006 14th European Signal Processing Conference.

[13]  Etienne Barnard,et al.  Extracting pronunciation rules for phonemic variants , 2006 .

[14]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[15]  Etienne Barnard,et al.  Pronunciation prediction with Default&Refine , 2008, Comput. Speech Lang..

[16]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[17]  John A. Bullinaria Text to phoneme alignment and mapping for speech technology: A neural networks approach , 2011, The 2011 International Joint Conference on Neural Networks.

[18]  Paul C. Bagshaw Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression , 1998, Comput. Speech Lang..

[19]  Florian Schiel,et al.  Syllable-based text-to-phoneme conversion for German , 2000, INTERSPEECH.

[20]  Julie Carson-Berndsen,et al.  Hidden Markov models with context-sensitive observations for grapheme-to-phoneme conversion , 2010, INTERSPEECH.

[21]  Hermann Ney,et al.  Hidden Conditional Random Fields with M-to-N Alignments for Grapheme-to-Phoneme Conversion , 2012, INTERSPEECH.

[22]  Anil Kumar Singh,et al.  Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training , 2009, HLT-NAACL.