Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech

Acoustic models used for statistical parametric speech synthesis typically incorporate many modelling assumptions. It is an open question to what extent these assumptions limit the naturalness of synthesised speech. To investigate this question, we recorded a speech corpus where each prompt was read aloud multiple times. By combining speech parameter trajectories extracted from different repetitions, we were able to quantify the perceptual effects of certain commonly used modelling assumptions. Subjective listening tests show that taking the source and filter parameters to be conditionally independent, or using diagonal covariance matrices, significantly limits the naturalness that can be achieved. Our experimental results also demonstrate the shortcomings of mean-based parameter generation. Index terms: speech synthesis, acoustic modelling, stream independence, diagonal covariance matrices, repeated speech

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[3]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[4]  藤村 靖,et al.  Gunnar Fant: Acoustic Theory of Speech Production : with Calculations based on X-Ray Studies of Russian Articulations, Mouton & Co, 1960, 's-Gravenhage $ 15 , 1962 .

[5]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[6]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[8]  Moncef Gabbouj,et al.  Ways to Implement Global Variance in Statistical Speech Synthesis , 2012, INTERSPEECH.

[9]  Junichi Yamagishi,et al.  An experimental comparison of multiple vocoder types , 2013, SSW.

[10]  William J. Byrne,et al.  Fast, low-artifact speech synthesis considering global variance , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[12]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[13]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[14]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[15]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[16]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[17]  S. Furui,et al.  AN ASSESSMENT OF AUTOMATIC RECOGNITION TECHNIQUES FOR SPONTANEOUS SPEECH IN COMPARISON WITH HUMAN PERFORMANCE , 2002 .

[18]  TodaTomoki,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007 .

[19]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[20]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[21]  Larry Gillick,et al.  Don't multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  A. Juneja A comparison of automatic and human speech recognition in null grammar. , 2012, The Journal of the Acoustical Society of America.

[23]  Heiga Zen,et al.  Autoregressive Models for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[25]  S. King,et al.  The Blizzard Challenge 2012 , 2012 .

[26]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[27]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .