A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis

The Continuous Wavelet Transform (CWT) has been re- cently proposed to model f0 in the context of speech synthe- sis. It was shown that systems using signal decomposition with the CWT tend to outperform systems that model the signal di- rectly. The f0 signal is typically decomposed into various scales of differing frequency. In these experiments, we reconstruct f0 with selected frequencies and ask native listeners to judge the naturalness of synthesized utterances with respect to natural speech. Results indicate that HMM-generated f0 is compara- ble to the CWT low frequencies, suggesting it mostly generates utterances with neutral intonation. Middle frequencies achieve very high levels of naturalness, while very high frequencies are mostly noise.

[1]  Oliver Watts,et al.  The role of higher-level linguistic features in HMM-based speech synthesis , 2010, INTERSPEECH.

[2]  Moncef Gabbouj,et al.  Hierarchical modeling of F0 contours for voice conversion , 2014, INTERSPEECH.

[3]  RECOMMENDATION ITU-R BS.1534-1 - Method for the subjective assessment of intermediate quality level of coding systems , 2003 .

[4]  Petr Motlícek,et al.  On the (UN)importance of the contextual factors in HMM-based speech synthesis and coding , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[6]  Simon King,et al.  Multidimensional scaling of listener responses to synthetic speech , 2005, INTERSPEECH.

[7]  Mark Hasegawa-Johnson,et al.  Signal-based and expectation-based factors in the perception of prosodic prominence , 2010 .

[8]  Martti Vainio,et al.  Continuous wavelet transform for analysis of speech prosody , 2013 .

[9]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[10]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[11]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Mohamed Hesham Farouk Application of Wavelets in Speech Processing , 2014, Springer Briefs in Electrical and Computer Engineering.

[13]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[14]  Simon King,et al.  Investigating the shortcomings of HMM synthesis , 2013, SSW.

[15]  Sabine Buchholz,et al.  Automatic Sentence Selection from Speech Corpora Including Diverse Speech for Improved HMM-TTS Synthesis Quality , 2011, INTERSPEECH.

[16]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.