Lombard modified text-to-speech synthesis for improved intelligibility: submission for the hurricane challenge 2013

This paper describes modification of a TTS system for improving the intelligibility of speech in various noise conditions. First, the GlottHMM vocoder is used for training a voice with modal speech data. The vocoder and voice parameters are then modified to mimic the properties of Lombard effect based on a small amount of Lombard speech from the same speaker. More specifically, the durations are increased, fundamental frequency is raised, spectral tilt is decreased, the harmonic-to-noise ratio is increased, and a pressed glottal flow pulses are used in creating excitation. The formants of the speech are also enhanced and finally the speech is compressed in order to increase noise robustness of the voice. The evaluation results of the Hurricane Challenge 2013 indicate that the modified voice is mostly less intelligible than the unmodified natural speech, as expected, but more intelligible than the reference TTS voice, especially in the low SNR conditions. Index Terms: Hurricane challenge, speech synthesis, GlottHMM, Lombard speech, intelligibility

[1]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[2]  Paavo Alku,et al.  Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2011: Utilizing Source Unit Selection in HMM-Based Speech Synthesis for Improved Excitation Generation , 2011 .

[4]  Martin Cooke,et al.  Maximising objective speech intelligibility by local f0 modulation , 2012, INTERSPEECH.

[5]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2012: Hybrid Approach , 2012 .

[6]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  P. Alku,et al.  A method for generating natural-sounding speech stimuli for cognitive brain research , 1999, Clinical Neurophysiology.

[8]  Paavo Alku,et al.  Comparison of formant enhancement methods for HMM-based speech synthesis , 2010, SSW.

[9]  Oliver Watts,et al.  The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[10]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Frank K. Soong,et al.  An HMM-Based Mandarin Chinese Text-To-Speech System , 2006, ISCSLP.

[12]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[13]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[14]  Paavo Alku,et al.  The GlottHMM Speech Synthesis Entry for Blizzard Challenge 2010 , 2010 .

[15]  Paavo Alku,et al.  HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[16]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[17]  Paavo Alku,et al.  Comparing glottal-flow-excited statistical parametric speech synthesis methods , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Paavo Alku,et al.  Analysis of HMM-Based Lombard Speech Synthesis , 2011, INTERSPEECH.

[19]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.