Excitation modeling based on waveform interpolation for HMM-based speech synthesis

It is generally known that a well-designed excitation produces high quality signals in hidden Markov model (HMM)-based speech synthesis systems. This paper proposes a novel tech- niques for generating excitation based on the waveform inter- polation (WI). For modeling WI parameters, we implemented statistical method like principal component analysis (PCA). The parameters of the proposed excitation modeling techniques can be easily combined with the conventional speech synthesis sys- tem under the HMM framework. From a number of experi- ments, the proposed method has been found to generate more naturally sounding speech. Index Terms: HMM-based speech synthesis, Waveform Inter- polation, Principal Component Analysis In this paper, we propose a novel approach to excitation modeling under the waveform interpolation (WI) framework. For parameterizing the excitation generation model, a charac- teristic waveform (CW) is extracted from each frame of LP residual signals. To derive a compact representation of each CW, we apply principal component analysis (PCA) to a collec- tion of the extracted CW's. Once PCA is done, each CW can be compactly approximated as a linear combination of a few PCA basis vectors. The statistical distribution of the linear com- bination coefficients and their dynamics can be efficiently de- scribed by means of HMM's for which the relevant parameters are estimated by following the conventional HMM training pro- cedure. Given a sentence we want to synthesize, the sequence of CW's can be generated from the trained HMM's according to the maximum likelihood (ML) criterion. The WI algorithm enables a smooth transition between adjacent CW's resulting in a more natural excitation signal. The major advantages of the proposed technique are twofold. First, instead of using a fixed set of waveforms such as the impulse train and the ran- dom noise, the proposed method finds CWs which represents the excitation waveforms from the various kinds of modeling in frequency domain. Second, the WI approach lets the excita- tion signal evolve smoothly, which may reduce the audible arti- facts of the synthesized speech. From a number of experiments on speech synthesis, it has been demonstrated that the propose technique enhances the quality of the synthesized speech.

[1]  Alan W. Black,et al.  Optimal data selection for unit selection synthesis , 2001, SSW.

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  智基 戸田,et al.  Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[6]  Thierry Dutoit,et al.  A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis , 2019, INTERSPEECH.

[7]  Christian Ritz,et al.  Extending waveform interpolation to wideband speech coding , 2002, Speech Coding, 2002, IEEE Workshop Proceedings..

[8]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[9]  Willem Bastiaan Kleijn,et al.  Continuous representations in linear predictive coding , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Minsoo Hahn,et al.  Two-Band Excitation for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[11]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[12]  Eddie L. T. Choy,et al.  Waveform Interpolation Speech Coder at 4 kb/s , 1998 .