Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the glimpse proportion

This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1-4kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.

[1]  Enrico Zovato,et al.  Speech synthesis enhancement in noisy environments , 2007, INTERSPEECH.

[2]  Roger K. Moore,et al.  C2H: A Computational Model of H&H-based Phonetic Contrast in Synthetic Speech , 2012, INTERSPEECH.

[3]  Roger K. Moore Computer Speech and Language , 1986 .

[4]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2011: Utilizing Source Unit Selection in HMM-Based Speech Synthesis for Improved Excitation Generation , 2011 .

[5]  Heiga Zen,et al.  Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[7]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[8]  N I Durlach,et al.  Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech. , 1985, Journal of speech and hearing research.

[9]  Keiichi Tokuda,et al.  Spectral representation of speech based on mel‐generalized cepstral coefficients and its properties , 2000 .

[10]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[11]  Thierry Dutoit,et al.  Continuous Control of the Degree of Articulation in HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[12]  Keiichi Tokuda,et al.  CELP coding system based on mel-generalized cepstral analysis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  Wouter A. Dreschler,et al.  ICRA Noises: Artificial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos , 2001 .

[14]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[15]  Ian McLoughlin,et al.  LSP-based speech modification for intelligibility enhancement , 1997, Proceedings of 13th International Conference on Digital Signal Processing.

[16]  Richard Heusdens,et al.  A speech preprocessing strategy for intelligibility improvement in noise based on a perceptual distortion measure , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Peter Howell,et al.  Strength of British English accents in altered listening conditions , 2006, Perception & psychophysics.

[18]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[20]  Paavo Alku,et al.  The GlottHMM Speech Synthesis Entry for Blizzard Challenge 2010 , 2010 .

[21]  Martin Cooke,et al.  Glimpsing speech , 2003, J. Phonetics.

[22]  Martin Cooke,et al.  Speech production modifications produced by competing talkers, babble, and stationary noise. , 2008, The Journal of the Acoustical Society of America.

[23]  Peter Vary,et al.  NEAR END LISTENING ENHANCEMENT CONSIDERING THERMAL LIMIT OF MOBILE PHONE LOUDSPEAKERS , 2011 .

[24]  Simon King,et al.  Mel cepstral coefficient modification based on the Glimpse Proportion measure for improving the intelligibility of HMM-generated synthetic speech in noise , 2012, INTERSPEECH.

[25]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[26]  Yan Tang,et al.  Energy reallocation strategies for speech enhancement in known noise conditions , 2010, INTERSPEECH.

[27]  Marion Dohen,et al.  An acoustic and articulatory study of Lombard speech: global effects on the utterance , 2006, INTERSPEECH.

[28]  Francisco Casacuberta,et al.  An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect , 1996, Speech Commun..

[29]  B. Moore,et al.  A revision of Zwicker's loudness model , 1996 .

[30]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[31]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .

[32]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[33]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[35]  Simon King,et al.  Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Alan W. Black,et al.  Improving the understandability of speech synthesis by modeling speech in noise , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[37]  Yannis Stylianou,et al.  Evaluating the intelligibility benefit of speech modifications in known noise conditions , 2013, Speech Commun..

[38]  Peter Vary,et al.  Near End Listening Enhancement: Speech Intelligibility Improvement in Noisy Environments , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[39]  R. Patel,et al.  The influence of linguistic content on the Lombard effect. , 2008, Journal of speech, language, and hearing research : JSLHR.

[40]  W. Dreschler,et al.  ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. International Collegium for Rehabilitative Audiology. , 2001, Audiology : official organ of the International Society of Audiology.

[41]  Simon King,et al.  Can Objective Measures Predict the Intelligibility of Modified HMM-Based Synthetic Speech in Noise? , 2011, INTERSPEECH.

[42]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Yan Tang,et al.  Optimised spectral weightings for noise-dependent speech intelligibility enhancement , 2012, INTERSPEECH.

[44]  Simon King,et al.  Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise , 2012, SAPA@INTERSPEECH.

[45]  Paavo Alku,et al.  Analysis of HMM-Based Lombard Speech Synthesis , 2011, INTERSPEECH.

[46]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[47]  Martin Cooke,et al.  Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. , 2010, The Journal of the Acoustical Society of America.