Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems

Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, because they are trained with neutral (non-Lombard) speech utterances recorded under quiet conditions to which noise is artificially added. In this paper, we investigate the effects that the Lombard reflex has on the performance of audio-visual speech enhancement systems based on deep learning. The results show that a gap in the performance of as much as approximately 5 dB between the systems trained on neutral speech and the ones trained on Lombard speech exists. This indicates the benefit of taking into account the mismatch between neutral and Lombard speech in the design of audio-visual speech enhancement systems.

[1]  Nathalie Henrich,et al.  Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? , 2014, Comput. Speech Lang..

[2]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[5]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[6]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[8]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[9]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[10]  Hiroshi Ishiguro,et al.  Analysis of the visual Lombard effect and automatic recognition experiments , 2013, Comput. Speech Lang..

[11]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[12]  Jon Barker,et al.  The impact of the Lombard effect on audio and visual speech recognition systems , 2018, Speech Commun..

[13]  Maëva Garnier,et al.  Hyper-articulation in Lombard speech: An active communicative strategy to enhance visible speech cues? , 2018, The Journal of the Acoustical Society of America.

[14]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  H. Brumm,et al.  The evolution of the Lombard effect: 100 years of psychoacoustic research , 2011 .

[16]  Martin Cooke,et al.  Speech production modifications produced in the presence of low-pass and high-pass filtered noise. , 2009, The Journal of the Acoustical Society of America.

[17]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  Zheng-Hua Tan,et al.  Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[19]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[20]  Lawrence J. Raphael,et al.  Speech Science Primer: Physiology, Acoustics, and Perception of Speech , 1980 .

[21]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[22]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[23]  Martin Cooke,et al.  Speech production modifications produced by competing talkers, babble, and stationary noise. , 2008, The Journal of the Acoustical Society of America.

[24]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[25]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  T. Wiley,et al.  Recognition of speech produced in noise. , 2001, Journal of speech, language, and hearing research : JSLHR.

[27]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[28]  Steve C. Maddock,et al.  A corpus of audio-visual Lombard speech with frontal and profile views. , 2018, The Journal of the Acoustical Society of America.

[29]  Jesper Jensen,et al.  On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[31]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.