Complex cepstrum for statistical parametric speech synthesis

Highlights? Complex cepstrum is applied to statistical parametric speech synthesis. ? At synthesis time, phase features derived from the allpass component of the complex cepstrum are used to implement a glottal pulse filter. ? Experimental results show that the addition of the phase features results in better synthetic speech quality. Statistical parametric synthesizers have typically relied on a simplified model of speech production. In this model, speech is generated using a minimum-phase filter, implemented from coefficients derived from spectral parameters, driven by a zero or random phase excitation signal. This excitation signal is usually constructed from fundamental frequencies and parameters used to control the balance between the periodicity and aperiodicity of the signal. The application of this approach to statistical parametric synthesis has partly been motivated by speech coding theory. However, in contrast to most real-time speech coders, parametric speech synthesizers do not require causality. This allows the standard simplified model to be extended to represent the natural mixed-phase characteristics of speech signals. This paper proposes the use of the complex cepstrum to model the mixed phase characteristics of speech through the incorporation of phase information in statistical parametric synthesis. The phase information is contained in the anti-causal portion of the complex cepstrum. These parameters have a direct connection with the shape of the glottal pulse of the excitation signal. Phase parameters are extracted on a frame-basis and are modeled in the same fashion as the minimum-phase synthesis filter parameters. At synthesis time, phase parameter trajectories are generated and used to modify the excitation signal. Experimental results show that the use of such complex cepstrum-based phase features results in better synthesized speech quality. Listening test results yield an average preference of 60% for the system with the proposed phase feature on both female and male voices.

[1]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[2]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[4]  Thierry Dutoit,et al.  Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[5]  Thierry Dutoit,et al.  Chirp complex cepstrum-based decomposition for asynchronous glottal analysis , 2010, INTERSPEECH.

[6]  J. Bee Bednar,et al.  Calculating the complex cepstrum without phase unwrapping or integration , 1985, IEEE Trans. Acoust. Speech Signal Process..

[7]  Mark J. F. Gales,et al.  Complex cepstrum as phase information in statistical parametric speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Thierry Dutoit,et al.  A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis , 2019, INTERSPEECH.

[9]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[10]  Werner Verhelst,et al.  A new model for the short-time complex cepstrum of voiced speech , 1986, IEEE Trans. Acoust. Speech Signal Process..

[11]  Wai C. Chu,et al.  Speech Coding Algorithms , 2003 .

[12]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[13]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[14]  Martin Vondra,et al.  Speech Modeling Using the Complex Cepstrum , 2010, COST 2102 Training School.

[15]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[16]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[17]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[18]  Sabine Buchholz,et al.  The Toshiba entry for the 2007 Blizzard Challenge , 2007 .

[19]  Jr. T. Quatieri Minimum and mixed phase speech analysis-synthesis by adaptive homomorphic deconvolution , 1979 .

[20]  Junichi Yamagishi,et al.  Towards an improved modeling of the glottal source in statistical parametric speech synthesis , 2007, SSW.

[21]  Bir Bhanu,et al.  Computation of complex cepstrum. , 1980 .

[22]  Paavo Alku,et al.  HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[23]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[24]  José Tribolet,et al.  A new phase unwrapping algorithm , 1977 .

[25]  Oliver Watts,et al.  The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[26]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.