A comparison of pronunciation modeling approaches for HMM-TTS

Hidden Markov model-based text-to-speech (HMM-TTS) systems are often trained on manual voice corpus phonetic transcriptions, despite the fact that because these manual pronunciations cannot be predicted with complete accuracy at synthesis time, the result is training/synthesis mismatch. In this paper, an alternate approach is proposed in which a set of manually written post-lexical effects (PLE) rules modeling a range of continuous speech effects are applied to canonical lexicon pronunciations, and the resulting matched PLE phone sequences are used both in the voice corpus markup and at synthesis time. For a US English system, a subjective evaluation showed that a system trained on matched PLE markup and a system trained on manual phone markup were equally preferred, suggesting that it may be possible to replace manual pronunciations with matched PLE pronunciations, dramatically decreasing the time and cost required to produce an HMM-TTS voice.