论文信息 - Improving TTS by higher agreement between predicted versus observed pronunciations

Improving TTS by higher agreement between predicted versus observed pronunciations

This paper looks at improving unit selection text-to-speech (TTS) quality by optimizing the agreement between frontend and speech database. We focused, in particular, on two classes of problems causing degradation in synthesis quality: 1) realization of /d/ and /t/1 sounds and 2) confusions of unstressed vowels, especially with schwas. We investigated two approaches to tackling these problems. First, we improved the phonological processing in the front end modules. Further improvement resulted from creating speaker-dependent pronunciation lexicons for automatic speech labeling of our voice databases. This change helped in alleviating many pronunciation errors that resulted from mismatches between lexical pronunciations and how the speaker (voice talent) actually pronounced a word, while keeping consistency in labeling. Each speaker has his or her own unique pronunciations (and context-dependent variations), so that no one standard lexicon is able to cover all of the speakers’ variations. A subjective listening test showed that combining these two approaches resulted in perceived quality improvement for American English male and female voices.

Ann K. Syrdal | Yeon-Jun Kim | Matthias Jilka

[1] Alan W. Black,et al. Evaluating and correcting phoneme segmentation for unit selection synthesis , 2003, INTERSPEECH.

[2] Corey Miller,et al. Pronunciation modeling in speech synthesis , 1998 .

[3] Wayne H. Ward,et al. Lexical tuning based on triphone confidence estimation , 1997, EUROSPEECH.

[4] Yeon-Jun Kim,et al. Automatic segmentation combining an HMM-based approach and spectral boundary correction , 2002, INTERSPEECH.

[5] Ann K. Syrdal,et al. The AT&t German text-to-speech system: realistic linguistic description , 2002, INTERSPEECH.

[6] Matthew J. Makashay,et al. Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[7] Maxine Eskénazi,et al. Automatic generation of context-dependent pronunciations , 1997, EUROSPEECH.

[8] Hong-Goo Kang,et al. A perspective on the next challenges for TTS research , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[9] P. Ladefoged. A course in phonetics , 1975 .

[10] Andrej Ljolje,et al. Automatic Generation of Detailed Pronunciation Lexicons , 1996 .