Constructing stylistic synthesis databases from audio books

In this paper, we explore how to construct stylistic TTS databases from audio books, in which a storyteller performs multiple roles. The goal is to identify and build a set of speech corpora, each of which not only portrays a representative voice style performed by the speaker, but also has sufficient sentences to synthesize natural speech using unit selection approach. We solve the problem in two procedures: first, by representing each role with Gaussian Mixture Models (GMM), all speech data are partitioned into a number of voice style clusters with a criterion that maximizes the likelihood of all utterances with respect to roles’ speaker models; then, pruning in terms of both acoustic and prosodic measures is followed to purify the clusters. The resulting 4 voice styles are subjectively interpreted as Neutral, Young, Elder and Adult, respectively. Perceptual experiments show that the proposed approach can synthesize speech with the recognizable voice styles with an average 72.5% identification rate, and the synthesized speech sounds better than those synthesized with utterances from a single role.

[1]  Yong Zhao,et al.  Custom-tailoring TTS voice font - keeping the naturalness when reducing database size , 2003, INTERSPEECH.

[2]  Richard Sproat,et al.  Identifying speakers in children's stories for speech synthesis , 2003, INTERSPEECH.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Alan W. Black Unit selection and emotional speech , 2003, INTERSPEECH.

[5]  Lijuan Wang,et al.  Exploring Expressive Speech Space in an Audio-book , 2005 .

[6]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[7]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Cecilia Ovesdotter Alm,et al.  Perceptions of emotions in expressive storytelling , 2005, INTERSPEECH.