Modeling of Speech Parameter Sequence Considering Global Variance for HMM-Based Speech Synthesis

Speech technologies such as speech recognition and speech synthesis have many potential applications since speech is the main way in which most people communicate. Various linguistic sounds are produced by controlling the configuration of oral cavities to convey a message in speech communication. The produced speech sounds temporally vary and are significantly affected by coarticulation effects. Thus, it is not straightforward to segment speech signals into corresponding linguistic symbols. Moreover, the acoustics of speech vary even if the same words are uttered by the same speaker due to differences in the manner of speaking and articulatory organs. Therefore, it is essential to stochastically model them in speech processing. The hidden Markov model (HMM) is an effective framework for modeling the acoustics of speech. Its introduction has enabled significant progress in speech and language technologies. In particular, there have been numerous efforts to develop HMM-based acoustic modeling techniques in speech recognition, and continuous density HMMs have been widely used in modern continuous speech recognition systems (Gales & Young (2008)). Moreover, several approaches have been proposed for applying the HMM-based acoustic modeling techniques to speech synthesis technologies (Donovan & Woodland (1995); Huang et al. (1996)) such as Text-to-Speech (TTS), which is ... from a given text. Recently, HMM-based speech synthesis has been proposed (Yoshimura et al. (1999)) and has generated interest owing to its various attractive features such as completely data-driven voice building, flexible voice quality control, speaker adaptation, small footprint, and so forth (Zen et al. (2009)). A basic framework of HMM-based speech synthesis consists of training and synthesis processes. In the training process, speech parameters such as spectral envelope and fundamental frequency (F0) are extracted from speech waveforms and then their time sequences are modeled by context-dependent phoneme HMMs. To model the dynamic characteristics of speech acoustics with HMMs, which assume piecewise constant statistics within an HMM state and conditional independence, a joint vector of static and dynamic features is usually used as an observation vector. In the synthesis process, a smoothly varying speech parameter trajectory is generated by maximizing the likelihood of a composite sentence HMM subject to a constraint between static and dynamic features with respect to not the observation vector sequence including both static and dynamic features but the static feature vector sequence (Tokuda et al. (2000)). Finally, a vocoding technique is employed Modeling of Speech Parameter Sequence Considering Global Variance for HMM-Based Speech Synthesis 6

[1]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[2]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[3]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[4]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[5]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Keiichi Tokuda,et al.  CELP coding based on mel-cepstral analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[9]  Shigeru Katagiri,et al.  A large-scale Japanese speech database , 1990, ICSLP.

[10]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2006, IEICE Trans. Inf. Syst..

[12]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[13]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[14]  Li-Rong Dai,et al.  Minimum generation error criterion considering global/local variance for HMM-based speech synthesis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Tomoki Toda,et al.  Trajectory training considering global variance for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[17]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[19]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[20]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[21]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[23]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[24]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.