Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech

Even the best statistical parametric speech synthesis systems do not achieve the naturalness of good unit selection. We investigated possible causes of this. By constructing speech signals that lie in between natural speech and the output from a complete HMM synthesis system, we investigated various effects of modelling. We manipulated the temporal smoothness and the variance of the spectral parameters to create stimuli, then presented these to listeners alongside natural and vocoded speech, as well as output from a full HMM-based text-to-speech system and from an idealised `pseudo-HMM'. All speech signals, except the natural waveform, were created using vocoders employing one of two popular spectral parameterisations: Mel-Cepstra or Mel-Line Spectral Pairs. Listeners made `same or different' pairwise judgements, from which we generated a perceptual map using Multidimensional Scaling. We draw conclusions about which aspects of HMM synthesis are limiting the naturalness of the synthetic speech.

[1]  Yamato Ohtani,et al.  Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Korin Richmond,et al.  Trajectory Mixture Density Networks with Multiple Mixtures for Acoustic-Articulatory Inversion , 2007, NOLISP.

[3]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[6]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[7]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[8]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[9]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[10]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[11]  Heiga Zen,et al.  Decision tree-based context clustering based on cross validation and hierarchical priors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Keiichi Tokuda,et al.  Introduction to the Issue on Statistical Parametric Speech Synthesis , 2014, IEEE J. Sel. Top. Signal Process..

[13]  Simon King,et al.  Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis , 2014, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[15]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[16]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[17]  Simon King,et al.  Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis , 2011, Speech Commun..

[18]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[19]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[21]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[22]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[23]  Moncef Gabbouj,et al.  Ways to Implement Global Variance in Statistical Speech Synthesis , 2012, INTERSPEECH.

[24]  Mark J. F. Gales,et al.  Building HMM-TTS Voices on Diverse Data , 2014, IEEE Journal of Selected Topics in Signal Processing.

[25]  S. King,et al.  The Blizzard Challenge 2012 , 2012 .