KLATTSTAT: knowledge-based parametric speech synthesis

This paper is an initial investigation into using knowledge-based parameters in the field of statistical parametric speech synthesis (SPSS). Utilizing the types of speech parameters used in the Klatt Formant Synthesizer we present automatic techniques for deriving such parameters from a speech database and building a statistical parametric speech synthesizer from these derived parameters. Although the work is exploratory, it shows promise in using more speech production inspired parameterizations for statistical speech synthesis.

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  Thierry Dutoit,et al.  Eigenresiduals for improved parametric speech synthesis , 2009, 2009 17th European Signal Processing Conference.

[3]  Thierry Dutoit,et al.  A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis , 2019, INTERSPEECH.

[4]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[6]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[7]  Florian Metze Discriminative speaker adaptation using articulatory features , 2007, Speech Commun..

[8]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[9]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  John Coleman,et al.  Acoustics of American English speech : a dynamic approach , 1993 .

[11]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[12]  Keiichi Tokuda,et al.  Minimum generation error training by using original spectrum as reference for log spectral distortion measure , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.