Investigating source and filter contributions, and their interaction, to statistical parametric speech synthesis

This paper presents an investigation of the separate perceptual degradations introduced by the modelling of source and filter features in statistical parametric speech synthesis. This is achieved using stimuli in which various permutations of natural, vocoded and modelled source and filter are combined, optionally with the addition of filter modifications (e.g. global variance or modulation spectrum scaling). We also examine the assumption of independence between source and filter parameters. Two complementary perceptual testing paradigms are adopted. In the first, we ask listeners to perform “same or different quality” judgements between pairs of stimuli from different configurations. In the second, we ask listeners to give an opinion score for individual stimuli. Combining the findings from these tests, we draw some conclusions regarding the relative contributions of source and filter to the currently rather limited naturalness of statistical parametric synthetic speech, and test whether current independence assumptions are justified.

[1]  Cassia Valentini-Botinhao,et al.  Hurricane natural speech corpus , 2013 .

[2]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[3]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2012: Hybrid Approach , 2012 .

[4]  Paavo Alku,et al.  Comparison of formant enhancement methods for HMM-based speech synthesis , 2010, SSW.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[7]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[8]  Paavo Alku,et al.  Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise , 2014, Comput. Speech Lang..

[9]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Simon King,et al.  Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[12]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[13]  Simon King,et al.  The Blizzard Challenge 2009 , 2009 .

[14]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[15]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[16]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[17]  Paavo Alku,et al.  Comparing glottal-flow-excited statistical parametric speech synthesis methods , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Paavo Alku,et al.  The GlottHMM Speech Synthesis Entry for Blizzard Challenge 2010 , 2010 .

[19]  Simon King,et al.  Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis , 2011, Speech Commun..

[20]  Paavo Alku,et al.  Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[22]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2011: Utilizing Source Unit Selection in HMM-Based Speech Synthesis for Improved Excitation Generation , 2011 .

[23]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[25]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[26]  I. Titze Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[27]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .