论文信息 - DCT-Based Amplitude and Frequency Modulated Harmonic-Plus-Noise Modelling for Text-to-Speech Synthesis

DCT-Based Amplitude and Frequency Modulated Harmonic-Plus-Noise Modelling for Text-to-Speech Synthesis

We present a harmonic-plus-noise modelling (HNM) strategy in the context of corpus-based text-to-speech (TTS) synthesis, in which whole speech phonemes are modelled in their integrity, contrary to the traditional frame-based approach. The pitch and amplitude trajectories of each phoneme are modelled with a low-order DCT expansion. The parameter analysis algorithm is to a large extent aided and guided by the pitch contours, and by the phonetic annotation and segmentation information that is available in any TTS system. The major advantages of our model are: few parameter interpolation points during synthesis (one per phoneme), flexible time and pitch modifications, and a reduction in the number of model parameters which is favourable for low bit rate coding in TTS for embedded applications. Listening tests on TTS sentences have shown that very natural speech can be obtained, despite the compactness of the signal representation.

Hugo Van hamme | Werner Verhelst | Sufian Irhimeh | Kris Hermus | Jan De Moortel

[1] Gerald Matz,et al. Time-frequency-autoregressive random processes: modeling and fast parameter estimation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2] H. Van hamme,et al. Robust speech recognition using cepstral domain missing data techniques and noisy masks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3] Laurent Girin,et al. Long term modeling of phase trajectories within the speech sinusoidal model framework , 2004, INTERSPEECH.

[4] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[5] Werner Verhelst,et al. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Laurent Girin,et al. Perceptually weighted long term modeling of sinusoidal speech amplitude trajectories , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7] Yannis Stylianou,et al. Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[8] Hugo Van hamme,et al. Estimation of the Voicing Cut-Off Frequency Contour Based on a Cumulative Harmonicity Score , 2007, IEEE Signal Processing Letters.