Conversational spontaneous speech synthesis using average voice model

This paper describes conversational spontaneous speech synthesis based on hidden Markov model (HMM). To reduce the amount of data required for model training, we utilize an average-voice-based speech synthesis framework, which has been shown to be effective for synthesizing speech with arbitrary speaker’s voice using a small amount of training data. We examine several kinds of average voice model using readingstyle speech and/or conversation-style speech. We also examine an appropriate utterance unit for conversational speech synthesis. Experimental results show that the proposed two-stage model adaptation method improves the quality of synthetic conversational speech.

[1]  Nick Campbell Developments in Corpus-Based Speech Synthesis: Approaching Natural Conversational Speech , 2005, IEICE Trans. Inf. Syst..

[2]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Takashi Nose,et al.  Speaker and style adaptation using average voice model for style control in HMM-based speech synthesis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Rüdiger Hoffmann,et al.  Toward spontaneous speech Synthesis-utilizing language model information in TTS , 2004, IEEE Transactions on Speech and Audio Processing.

[5]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[6]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[7]  S. Furui,et al.  Toward hidden Markov model‐based spontaneous speech synthesis , 2006 .

[8]  Chung-Hsien Wu,et al.  Pronunciation variation generation for spontaneous speech synthesis using state-based voice transformation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .

[10]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[12]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[13]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..