Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis

This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The latter coefficients contain both the pitch and a compact representation of target residual frames. The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input. Subjective results show a relevant improvement compared to the basic technique.

[1]  P. Mermelstein,et al.  Low-rate quantization of spectral information in a 4 kb/s pitch-synchronous CELP coder , 2000, 2000 IEEE Workshop on Speech Coding. Proceedings. Meeting the Challenges of the New Millennium (Cat. No.00EX421).

[2]  Luís C. Oliveira,et al.  Pitch-synchronous time-scaling for prosodic and voice quality transformations , 2005, INTERSPEECH.

[3]  Hideki Kawahara,et al.  Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay , 2000, INTERSPEECH.

[4]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Luís C. Oliveira,et al.  Pitch-synchronous time-scaling for high-frequency excitation regeneration , 2005, INTERSPEECH.

[6]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[7]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2006, IEICE Trans. Inf. Syst..

[9]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[10]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[11]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[12]  Joan Claudi Socoró,et al.  Linguistic and mixed excitation improvements on a HMM-based speech synthesis for Castilian Spanish , 2007, SSW.

[13]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.