Complex cepstrum as phase information in statistical parametric speech synthesis

Statistical parametric synthesizers usually rely on a simplified model of speech production where a minimum-phase filter is driven by a zero or random phase excitation signal. However, this procedure does not take into account the natural mixed-phase characteristics of the speech signal. This paper addresses this issue by proposing the use of the complex cepstrum for modeling phase information in statistical parametric speech synthesizers. Here a frame-based complex cepstrum is calculated through the interpolation of pitch-synchronous magnitude and unwrapped phase spectra. The noncausal part of the frame-based complex cepstrum is then modeled as phase features in the statistical parametric synthesizer. At synthesis time, the generated phase parameters are used to derive coefficients of a glottal filter. Experimental results show that the proposed approach effectively embeds phase information in the synthetic speech, resulting in close-to-natural waveforms and better speech quality.

[1]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[3]  Thierry Dutoit,et al.  Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[4]  Thierry Dutoit,et al.  Chirp complex cepstrum-based decomposition for asynchronous glottal analysis , 2010, INTERSPEECH.

[5]  Paavo Alku,et al.  HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[6]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[7]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[8]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Junichi Yamagishi,et al.  HMM-based speech synthesiser using the LF-model of the glottal source , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[11]  Kuldip K. Paliwal,et al.  Usefulness of phase spectrum in human speech perception , 2003, INTERSPEECH.

[12]  Masatsune Tamura,et al.  Sub-band basis spectrum model for pitch-synchronous log-spectrum and phase based on approximation of sparse coding , 2010, INTERSPEECH.

[13]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .