论文信息 - Complex cepstrum as phase information in statistical parametric speech synthesis

Complex cepstrum as phase information in statistical parametric speech synthesis

Statistical parametric synthesizers usually rely on a simplified model of speech production where a minimum-phase filter is driven by a zero or random phase excitation signal. However, this procedure does not take into account the natural mixed-phase characteristics of the speech signal. This paper addresses this issue by proposing the use of the complex cepstrum for modeling phase information in statistical parametric speech synthesizers. Here a frame-based complex cepstrum is calculated through the interpolation of pitch-synchronous magnitude and unwrapped phase spectra. The noncausal part of the frame-based complex cepstrum is then modeled as phase features in the statistical parametric synthesizer. At synthesis time, the generated phase parameters are used to derive coefficients of a glottal filter. Experimental results show that the proposed approach effectively embeds phase information in the synthetic speech, resulting in close-to-natural waveforms and better speech quality.

Mark J. F. Gales | Ranniery Maia | Masami Akamine

[1] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2] A. W. M. van den Enden,et al. Discrete Time Signal Processing , 1989 .

[3] Thierry Dutoit,et al. Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[4] Thierry Dutoit,et al. Chirp complex cepstrum-based decomposition for asynchronous glottal analysis , 2010, INTERSPEECH.

[5] Paavo Alku,et al. HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[6] John H. L. Hansen,et al. Discrete-Time Processing of Speech Signals , 1993 .

[7] Heiga Zen,et al. Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[8] Keiichi Tokuda,et al. An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Junichi Yamagishi,et al. HMM-based speech synthesiser using the LF-model of the glottal source , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Philip J. B. Jackson,et al. Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[11] Kuldip K. Paliwal,et al. Usefulness of phase spectrum in human speech perception , 2003, INTERSPEECH.

[12] Masatsune Tamura,et al. Sub-band basis spectrum model for pitch-synchronous log-spectrum and phase based on approximation of sparse coding , 2010, INTERSPEECH.

[13] J. Liljencrants,et al. Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .