Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field. We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of -5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

[1]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Gail M. Sullivan,et al.  Using Effect Size-or Why the P Value Is Not Enough. , 2012, Journal of graduate medical education.

[4]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[5]  Andy P. Field,et al.  Discovering Statistics Using Ibm Spss Statistics , 2017 .

[6]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[7]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[8]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[9]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[10]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Lisa Tang,et al.  Examining visible articulatory features in clear and plain speech , 2015, Speech Commun..

[13]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[15]  Martin Cooke,et al.  Speech production modifications produced in the presence of low-pass and high-pass filtered noise. , 2009, The Journal of the Acoustical Society of America.

[16]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[17]  Rainer Martin,et al.  SPEECH ENHANCEMENT IN THE DFT DOMAIN USING LAPLACIAN SPEECH PRIORS , 2003 .

[18]  Hiroshi Ishiguro,et al.  Analysis of the visual Lombard effect and automatic recognition experiments , 2013, Comput. Speech Lang..

[19]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[20]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[22]  Jont B. Allen,et al.  Short term spectral analysis, synthesis, and modification by discrete Fourier transform , 1977 .

[23]  RECOMMENDATION ITU-R BS.1534-1 - Method for the subjective assessment of intermediate quality level of coding systems , 2003 .

[24]  Catarina Mendonça,et al.  Statistical Tests with MUSHRA Data , 2018 .

[25]  Jesper Jensen,et al.  Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[27]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[28]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  R. Watt,et al.  Towards Multi-modal Hearing Aid Design and Evaluation in Realistic Audio-Visual Settings : Challenges and Opportunities , 2017 .

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Maik C. Stüttgen,et al.  Computation of measures of effect size for neuroscience data sets , 2011, The European journal of neuroscience.

[32]  Amir Hussain,et al.  Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments , 2013, Cognitive Computation.

[33]  Philipos C. Loizou,et al.  Speech Quality Assessment , 2011, Multimedia Analysis, Processing and Communications.

[34]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[35]  Najwa Alghamdi,et al.  Visual speech enhancement and its application in speech perception training , 2017 .

[36]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[38]  T. Wiley,et al.  Recognition of speech produced in noise. , 2001, Journal of speech, language, and hearing research : JSLHR.

[39]  Jon Barker,et al.  The impact of the Lombard effect on audio and visual speech recognition systems , 2018, Speech Commun..

[40]  Steve C. Maddock,et al.  A corpus of audio-visual Lombard speech with frontal and profile views. , 2018, The Journal of the Acoustical Society of America.

[41]  Jesper Jensen,et al.  On Training Targets and Objective Functions for Deep-learning-based Audio-visual Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[43]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[44]  Marion Dohen,et al.  An acoustic and articulatory study of Lombard speech: global effects on the utterance , 2006, INTERSPEECH.

[45]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[47]  Lucie Ménard,et al.  Effect of being seen on the production of visible speech cues. A pilot study on Lombard speech , 2012, INTERSPEECH.

[48]  H. Brumm,et al.  The evolution of the Lombard effect: 100 years of psychoacoustic research , 2011 .

[49]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[50]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[51]  Yu Tsao,et al.  Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network , 2017, ArXiv.

[52]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[53]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[54]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[55]  Jesper Jensen,et al.  Spectral Magnitude Minimum Mean-Square Error Estimation Using Binary and Continuous Gain Functions , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Rainer Martin,et al.  On the Statistics of Spectral Amplitudes After Variance Reduction by Temporal Cepstrum Smoothing and Cepstral Nulling , 2009, IEEE Transactions on Signal Processing.

[57]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[58]  John F. Magnotti,et al.  Variability and stability in the McGurk effect: contributions of participants, stimuli, time, and response type , 2015, Psychonomic bulletin & review.

[59]  D. Dubois,et al.  Influence of sound immersion and communicative interaction on the Lombard effect. , 2010, Journal of speech, language, and hearing research : JSLHR.

[60]  Yi Hu,et al.  A comparative intelligibility study of single-microphone noise reduction algorithms. , 2007, The Journal of the Acoustical Society of America.

[61]  Zheng-Hua Tan,et al.  Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[62]  Nathalie Henrich,et al.  Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? , 2014, Comput. Speech Lang..

[63]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[65]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[66]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[67]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[68]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[69]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[70]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[71]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[72]  Ben P. Milner,et al.  Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise , 2006, INTERSPEECH.

[73]  Stefanos Zafeiriou,et al.  300 Faces In-The-Wild Challenge: database and results , 2016, Image Vis. Comput..

[74]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[75]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[76]  Simon King,et al.  The listening talker: A review of human and algorithmic context-induced modifications of speech , 2014, Comput. Speech Lang..

[77]  Luciano Fadiga,et al.  Face Landmark-based Speaker-independent Audio-visual Speech Enhancement in Multi-talker Environments , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[78]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[79]  Alexander Raake,et al.  Colouration in Local Wave Field Synthesis , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[80]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[81]  Amir Hussain,et al.  Cognitively inspired speech processing for multimodal hearing technology , 2014, 2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE).

[82]  Martin Cooke,et al.  Speech production modifications produced by competing talkers, babble, and stationary noise. , 2008, The Journal of the Acoustical Society of America.

[83]  Lawrence J. Raphael,et al.  Speech Science Primer: Physiology, Acoustics, and Perception of Speech , 1980 .

[84]  Hani Yehia,et al.  Audiovisual Lombard speech: reconciling production and perception , 2007, AVSP.

[85]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[86]  Ling Liu,et al.  Encyclopedia of Database Systems , 2009, Encyclopedia of Database Systems.