Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-byframe feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given melcepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures. key words: integrative model, HMM-based speech synthesis, acoustic modeling, mel-cepstral analysis, trajectory HMM

[1]  F. Itakura,et al.  A statistical method for estimation of speech spectral density and formant frequencies , 1970 .

[2]  S. Imai,et al.  Mel Log Spectrum Approximation (MLSA) filter for speech synthesis , 1983 .

[3]  K. Dzhaparidze Parameter estimation and hypothesis testing in spectral analysis of stationary time series , 1986 .

[4]  P. Laplace Memoir on the Probability of the Causes of Events , 1986 .

[5]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[6]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[7]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[9]  Martin A. Riedmiller,et al.  Rprop - Description and Implementation Details , 1994 .

[10]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[11]  T. Masuko Speech synthesis from HMMs using dynamic features , 1996 .

[12]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[15]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[16]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[18]  Heiga Zen,et al.  Trajectory modeling based on HMMs with the explicit relationship between static and dynamic features , 2003, INTERSPEECH.

[19]  K. Tokuda,et al.  Mixture Density Models Based on Mel-Cepstral Representation of Gaussian Process , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[20]  Heiga Zen,et al.  An introduction of trajectory model into HMM-based speech synthesis , 2004, SSW.

[21]  Heiga Zen,et al.  A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Michael White,et al.  Learning to Say It Well: Reranking Realizations by Predicted Synthesis Quality , 2006, ACL.

[23]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[24]  Yoshihiko Nankaku,et al.  Simultaneous Acoustic, Prosodic, and Phrasing Model Training for TTs Conversion Systems , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[25]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[26]  Keiichi Tokuda,et al.  Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Oliver Lemon,et al.  Predicting how it sounds: re-ranking dialogue prompts based on TTS quality for adaptive spoken dialogue systems , 2009, INTERSPEECH.

[28]  Jen-Tzung Chien,et al.  Joint acoustic and language modeling for speech recognition , 2010, Speech Commun..

[29]  Stephan Vogel,et al.  Improving speech synthesis of machine translation output , 2010, INTERSPEECH.

[30]  Heiga Zen,et al.  Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters , 2010, SSW.

[31]  Keiichi Tokuda,et al.  Impacts of machine translation and speech synthesis on speech-to-speech translation , 2012, Speech Commun..

[32]  Yoshihiko Nankaku,et al.  Integration of acoustic modeling and mel-cepstral analysis for HMM-based speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.