The impact of the Lombard effect on audio and visual speech recognition systems

Abstract When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous studies of this impact have used very small amounts of data and recognition systems that lack modern adaptation strategies. This paper aims to rectify this by using a new audio-visual Lombard corpus containing speech from 54 different speakers – significantly larger than any previously available – and modern state-of-the-art speech recognition techniques. The paper is organised as three speech-in-noise recognition studies. The first examines the case in which a system is presented with Lombard speech having been exclusively trained on normal speech. It was found that the Lombard mismatch caused a significant decrease in performance even if the level of the Lombard speech was normalised to match the level of normal speech. However, the size of the mismatch was highly speaker-dependent thus explaining conflicting results presented in previous smaller studies. The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in recognition performance. Part of this is due to the greater energy leading to a reduction in noise masking, but performance improvements persist even after the effect of signal-to-noise level difference is compensated. An analysis across speakers shows that the Lombard speech energy is spectro-temporally distributed in a way that reduces energetic masking, and this reduction in masking is associated with an increase in recognition performance. The final study repeats the first two using a recognition system training on visual speech. In the visual domain, performance differences are not confounded by differences in noise masking. It was found that in matched-conditions Lombard speech supports better recognition performance than normal speech. The benefit was consistently present across all speakers but to a varying degree. Surprisingly, the Lombard benefit was observed to a small degree even when training on mismatched non-Lombard visual speech, i.e., the increased clarity of the Lombard speech outweighed the impact of the mismatch. The paper presents two generally applicable conclusions: i) systems that are designed to operate in noise will benefit from being trained on well-matched Lombard speech data, ii) the results of speech recognition evaluations that employ artificial speech and noise mixing need to be treated with caution: they are overly-optimistic to the extent that they ignore a significant source of mismatch but at the same time overly-pessimistic in that they do not anticipate the potential increased intelligibility of the Lombard speaking style.

[1]  David B Pisoni,et al.  Some normative data on lip-reading skills (L). , 2011, The Journal of the Acoustical Society of America.

[2]  Hiroshi Ishiguro,et al.  Analysis of the visual Lombard effect and automatic recognition experiments , 2013, Comput. Speech Lang..

[3]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[4]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[5]  Jeesun Kim,et al.  Perceptual processing of audiovisual Lombard speech , 2006 .

[6]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[7]  Ning Ma,et al.  Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Fu Jie Huang,et al.  Consideration of Lombard effect for speechreading , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[9]  John H. L. Hansen,et al.  Analysis and compensation of stressed and noisy speech with application to robust automatic recognition , 1988 .

[10]  Virginia Best,et al.  How Visual Cues for when to Listen Aid Selective Auditory Attention , 2012, Journal of the Association for Research in Otolaryngology.

[11]  H. Brumm,et al.  The evolution of the Lombard effect: 100 years of psychoacoustic research , 2011 .

[12]  Jeesun Kim,et al.  Auditory and auditory-visual Lombard speech perception by younger and older adults , 2013, AVSP.

[13]  John Makhoul,et al.  Speaker adaptive training: a maximum likelihood approach to speaker normalization , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Kazuya Takeda,et al.  Variability of Lombard effects under different noise conditions , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[17]  D W Massaro,et al.  Perception of asynchronous and conflicting visual and auditory speech. , 1996, The Journal of the Acoustical Society of America.

[18]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[19]  Hani Yehia,et al.  Audiovisual Lombard speech: reconciling production and perception , 2007, AVSP.

[20]  Jeesun Kim,et al.  Hearing Speech in Noise: Seeing a Loud Talker is Better , 2011, Perception.

[21]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[22]  Jon Barker,et al.  Modelling speaker intelligibility in noise , 2007, Speech Commun..

[23]  J S Perkell,et al.  Effects of short-term auditory deprivation on speech production in adult cochlear implant users. , 1992, The Journal of the Acoustical Society of America.

[24]  B. J. Stanton,et al.  Robust recognition of loud and Lombard speech in the fighter cockpit environment , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[26]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[27]  T. Wiley,et al.  Recognition of speech produced in noise. , 2001, Journal of speech, language, and hearing research : JSLHR.

[28]  Nathalie Henrich,et al.  Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? , 2014, Comput. Speech Lang..

[29]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Martin Cooke,et al.  The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise , 2009, Speech Commun..

[31]  Lisa Tang,et al.  Examining visible articulatory features in clear and plain speech , 2015, Speech Commun..

[32]  John H. L. Hansen,et al.  Source generator equalization and enhancement of spectral properties for robust speech recognition in noise and stress , 1995, IEEE Trans. Speech Audio Process..

[33]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[34]  Martin Cooke,et al.  Speech production modifications produced by competing talkers, babble, and stationary noise. , 2008, The Journal of the Acoustical Society of America.

[35]  B. J. Stanton,et al.  Acoustic-phonetic analysis of loud and Lombard speech in simulated cockpit conditions , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[36]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[37]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[38]  Jeesun Kim,et al.  The effect of seeing the interlocutor on auditory and visual speech production in noise , 2015, Speech Commun..

[39]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  M. Picheny,et al.  Speaking clearly for the hard of hearing. II: Acoustic characteristics of clear and conversational speech. , 1986, Journal of speech and hearing research.

[41]  V C Tartter,et al.  Some acoustic effects of listening to noise on speech production. , 1993, The Journal of the Acoustical Society of America.

[42]  John H. L. Hansen,et al.  Robust speech recognition training via duration and spectral-based stress token generation , 1995, IEEE Trans. Speech Audio Process..

[43]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[44]  R. Patel,et al.  The influence of linguistic content on the Lombard effect. , 2008, Journal of speech, language, and hearing research : JSLHR.

[45]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[46]  Eric Vatikiotis-Bateson,et al.  Auditory, but perhaps not visual, processing of Lombard speech , 2006 .

[47]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[48]  Martin Cooke,et al.  The contribution of durational and spectral changes to the Lombard speech intelligibility benefit. , 2014, The Journal of the Acoustical Society of America.

[49]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[50]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[51]  Guy J. Brown,et al.  Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).