Improving TTS by higher agreement between predicted versus observed pronunciations

This paper looks at improving unit selection text-to-speech (TTS) quality by optimizing the agreement between frontend and speech database. We focused, in particular, on two classes of problems causing degradation in synthesis quality: 1) realization of /d/ and /t/1 sounds and 2) confusions of unstressed vowels, especially with schwas. We investigated two approaches to tackling these problems. First, we improved the phonological processing in the front end modules. Further improvement resulted from creating speaker-dependent pronunciation lexicons for automatic speech labeling of our voice databases. This change helped in alleviating many pronunciation errors that resulted from mismatches between lexical pronunciations and how the speaker (voice talent) actually pronounced a word, while keeping consistency in labeling. Each speaker has his or her own unique pronunciations (and context-dependent variations), so that no one standard lexicon is able to cover all of the speakers’ variations. A subjective listening test showed that combining these two approaches resulted in perceived quality improvement for American English male and female voices.