Improving Mandarin Prosody Generation Using Alternative Smoothing Techniques

Prosody plays a vital role for conveying both communicative meanings and specific speaking styles in speech communication. In recent years, Hidden Markov Model (HMM)-based synthesis system (HTS) has been developed in triumph, which can synthesize stable and smooth speech. However, the prosody of the synthesized speech suffers from the over-smoothing problem. Thus, a better prosodic model is required to improve the natural variability of the synthesized speech. This study exploits a hybrid method to alleviate this problem by combining the statistical and the template-based unit selection methods. First, a two-level clustering approach is proposed to obtain representative prosodic patterns (denoted by codewords) of the hierarchical prosodic structure modeled by a modified Fujisaki model. The prosodic codewords are then used to represent the prosody of each sentence in the parallel corpus consisting of the real speech corpus and the synthesized counterpart obtained from the HTS. The synthesized speech utterance is then used as the query for retrieving the prosodic codewords of the utterances in the synthesized corpus. The retrieved synthesized prosodic codewords are mapped to the prosodic codewords of the real speech based on linear mapping rules obtained from the parallel corpus. The prosodic codeword language models for prosodic word and prosodic phrase are employed respectively to choose the optimal codeword sequence of the real speech. Finally, the most likely sequence of prosodic codewords can be obtained based on the NURBS-based continuity measure for synthesizing speech with natural prosody. The experimental results of subjective and objective tests demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of the HMM-based method.

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Chiu-yu Tseng,et al.  Fluent speech prosody: Framework and modeling , 2005, Speech Commun..

[3]  Yi Xu,et al.  Speech melody as articulatorily implemented communicative functions , 2005, Speech Commun..

[4]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[6]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yong Zhao,et al.  Modeling stylized invariance and local variability of prosody in text-to-speech synthesis , 2006, Speech Commun..

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Cai Rui TH-CoSS,a Mandarin Speech Corpus for TTS , 2007 .

[10]  W.-S. Chen,et al.  Movement Epenthesis Generation Using NURBS-Based Spatial Interpolation , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Yu Hu,et al.  Towards the automatic extraction of fujisaki model parameters for Mandarin , 2003, INTERSPEECH.

[12]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Hansjörg Mixdorff,et al.  Building an integrated prosodic model of German , 2001, INTERSPEECH.

[14]  Matthias Pätzold,et al.  Analysis and synthesis of German F0 contours by means of Fujisaki's model , 1993, Speech Commun..

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Keikichi Hirose,et al.  A new method for FO tracking errors fix and generation in HMM-based Mandarin speech synthesis using generation process model , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[17]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[18]  Chiu-yu Tseng,et al.  From Ripples to Waves, Tides and Beyond , 2014 .

[19]  Chiu-yu Tseng Speech Rate and Prosody Units: Evidence of Interaction from Mandarin Chinese , 2003 .

[20]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Thomas Baer,et al.  An articulatory synthesizer for perceptual research , 1978 .

[22]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[23]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[24]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[25]  Jianhua Tao,et al.  A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[26]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[27]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[28]  Takao Kobayashi,et al.  Model Adaptation Approach to Speech Synthesis with Diverse Voices and Styles , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Sumio Ohno,et al.  Analysis and synthesis of fundamental frequency contours of Standard Chinese using the command-response model , 2005, Speech Commun..

[30]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[31]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[32]  Keikichi Hirose,et al.  A method for modeling and generating Mandarin tone contour with phrase intonation based on the generation process model , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[33]  Chung-Hsien Wu,et al.  Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Yu Shi,et al.  Segmental tonal modeling for phone set design in Mandarin LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[36]  Chung-Hsien Wu,et al.  Personalized Spectral and Prosody Conversion Using Frame-Based Codeword Distribution and Adaptive CRF , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  趙 元任,et al.  A grammar of spoken Chinese = 中國話的文法 , 1968 .

[38]  Keikichi Hirose,et al.  A method for generation of Mandarin F0 contours based on tone nucleus model and superpositional model , 2012, Speech Commun..

[39]  Hisashi Kawai,et al.  Realization of linguistic information in the voice fundamental frequency contour of the spoken Japanese , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.