A non-intrusive method for estimating binaural speech intelligibility from noise-corrupted signals captured by a pair of microphones

A non-intrusive method is introduced to predict binaural speech intelligibility in noise directly from signals captured using a pair of microphones. The approach combines signal processing techniques in blind source separation and localisation, with an intrusive objective intelligibility measure (OIM). Therefore, unlike classic intrusive OIMs, this method does not require a clean reference speech signal and knowing the location of the sources to operate. The proposed approach is able to estimate intelligibility in stationary and fluctuating noises, when the noise masker is presented as a point or diffused source, and is spatially separated from the target speech source on a horizontal plane. The performance of the proposed method was evaluated in two rooms. When predicting subjective intelligibility measured as word recognition rate, this method showed reasonable predictive accuracy with correlation coefficients above 0.82, which is comparable to that of a reference intrusive OIM in most of the conditions. The proposed approach offers a solution for fast binaural intelligibility prediction, and therefore has practical potential to be deployed in situations where on-site speech intelligibility is a concern.

[1]  Jesper Jensen,et al.  Predicting the Intelligibility of Noisy and Nonlinearly Processed Binaural Speech , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Andrew Blake,et al.  Nonlinear filtering for speaker tracking in noisy and reverberant environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Stefano Cosentino,et al.  A model that predicts the binaural advantage to speech intelligibility from the mixed target and interferer signals. , 2014, The Journal of the Acoustical Society of America.

[4]  Simon J. Godsill,et al.  Acoustic Source Localization and Tracking of a Time-Varying Number of Speakers , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Alfred Mertins,et al.  Analysis and design of gammatone signal models. , 2009, The Journal of the Acoustical Society of America.

[6]  B Kollmeier,et al.  Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model. , 1996, The Journal of the Acoustical Society of America.

[7]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[8]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[9]  Mike Brookes,et al.  Effects of noise suppression on intelligibility: dependency on signal-to-noise ratios. , 2012, The Journal of the Acoustical Society of America.

[10]  R. Drullman,et al.  Binaural intelligibility prediction based on the speech transmission index. , 2008, The Journal of the Acoustical Society of America.

[11]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Ruth Y Litovsky,et al.  The benefit of binaural hearing in a cocktail party: effect of location and type of interferer. , 2004, The Journal of the Acoustical Society of America.

[13]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[14]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[15]  Ruth Y Litovsky,et al.  The role of head-induced interaural time and level differences in the speech reception threshold for multiple interfering sound sources. , 2004, The Journal of the Acoustical Society of America.

[16]  Birger Kollmeier,et al.  PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Bruno Fazenda,et al.  Evaluating a distortion-weighted glimpsing metric for predicting binaural speech intelligibility in rooms , 2016, Speech Commun..

[18]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[19]  SharmaDushyant,et al.  A data-driven non-intrusive measure of speech quality and intelligibility , 2016 .

[20]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  W. Bastiaan Kleijn,et al.  Low-Complexity, Nonintrusive Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  E. Shaw,et al.  Transformation of sound-pressure level from the free field to the eardrum presented in numerical form. , 1985, The Journal of the Acoustical Society of America.

[23]  Jesper Jensen,et al.  A binaural short time objective intelligibility measure for noisy and enhanced speech , 2015, INTERSPEECH.

[24]  Torsten Dau,et al.  Requirements for the evaluation of computational speech segregation systems. , 2014, The Journal of the Acoustical Society of America.

[25]  Yan Tang,et al.  Optimised spectral weightings for noise-dependent speech intelligibility enhancement , 2012, INTERSPEECH.

[26]  Volkan Cevher,et al.  Model-based sparse component analysis for reverberant speech localization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Mike Brookes,et al.  A data-driven non-intrusive measure of speech quality and intelligibility , 2016, Speech Commun..

[30]  L. Rabiner,et al.  Predicting binaural gain in intelligibility and release from masking for speech. , 1967, Journal of the Acoustical Society of America.

[31]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[32]  N. I. Durlach,et al.  Binaural signal detection - Equalization and cancellation theory. , 1972 .

[33]  Birger Kollmeier,et al.  Speech Intelligibility Prediction in Hearing-Impaired Listeners for Steady and Fluctuating Noise , 2019, Modeling Sensorineural Hearing Loss.

[34]  Atiyeh Alinaghi,et al.  Joint Mixing Vector and Binaural Model Based Stereo Source Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Bruno Fazenda,et al.  A glimpse-based approach for predicting binaural intelligibility with single and multiple maskers in anechoic conditions , 2015, INTERSPEECH.

[36]  Eric A. Lehmann,et al.  Particle Filter Design Using Importance Sampling for Acoustic Source Localisation and Tracking in Reverberant Environments , 2006, EURASIP J. Adv. Signal Process..

[37]  Ba-Ngu Vo,et al.  Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach , 2006, IEEE Transactions on Signal Processing.

[38]  Yan Tang,et al.  Predicting Binaural Speech Intelligibility from Signals Estimated by a Blind Source Separation Algorithm , 2016, INTERSPEECH.

[39]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[40]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[41]  F. F. Lia,et al.  Speech transmission index from running speech : A neural network approach , 2018 .

[42]  Mike Brookes,et al.  Data driven method for non-intrusive speech intelligibility estimation , 2010, 2010 18th European Signal Processing Conference.

[43]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Torsten Dau,et al.  A multi-resolution envelope-power based model for speech intelligibility. , 2013, The Journal of the Acoustical Society of America.

[45]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[46]  S. Rosen,et al.  Uncomodulated glimpsing in "checkerboard" noise. , 1993, The Journal of the Acoustical Society of America.

[47]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Ira J. Hirsh,et al.  The Relation between Localization and Intelligibility , 1950 .

[49]  Dorothea Kolossa,et al.  Twin-HMM-based non-intrusive speech intelligibility prediction , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[51]  Paris Smaragdis,et al.  Experiments on deep learning for speech denoising , 2014, INTERSPEECH.

[52]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[53]  Stefano Cosentino,et al.  Objective speech intelligibility measurement for cochlear implant users in complex listening environments , 2013, Speech Commun..

[54]  F F Li,et al.  Speech transmission index from running speech: a neural network approach. , 2003, The Journal of the Acoustical Society of America.

[55]  K. S. Rhebergen,et al.  A Speech Intelligibility Index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners. , 2005, The Journal of the Acoustical Society of America.

[56]  Cassia Valentini-Botinhao,et al.  Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech , 2016, Comput. Speech Lang..

[57]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[58]  N. Durlach Equalization and Cancellation Theory of Binaural Masking‐Level Differences , 1963 .

[59]  Sergios Theodoridis,et al.  A Novel Efficient Cluster-Based MLSE Equalizer for Satellite Communication Channels with-QAM Signaling , 2006, EURASIP J. Adv. Signal Process..

[60]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Mathieu Lavandier,et al.  Prediction of binaural speech intelligibility against noise in rooms. , 2010, The Journal of the Acoustical Society of America.

[62]  Tiago H. Falk,et al.  Updating the SRMR-CI Metric for Improved Intelligibility Prediction for Cochlear Implant Users , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[63]  Yan Tang,et al.  Glimpse-Based Metrics for Predicting Speech Intelligibility in Additive Noise Conditions , 2016, INTERSPEECH.

[64]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[65]  Yang Yu,et al.  Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks , 2016, EURASIP J. Audio Speech Music. Process..

[66]  Kuldip K. Paliwal,et al.  Improving objective intelligibility prediction by combining correlation and coherence based methods with a measure based on the negative distortion ratio , 2012, Speech Commun..

[67]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[68]  T. Brand,et al.  Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model. , 2009, The Journal of the Acoustical Society of America.

[69]  Mathieu Lavandier,et al.  Revision and validation of a binaural model for speech intelligibility in noise , 2011, Hearing Research.

[70]  Tammo Houtgast,et al.  A detailed study on the effects of noise on speech intelligibility. , 2007, The Journal of the Acoustical Society of America.

[71]  Bruno M Fazenda,et al.  A metric for predicting binaural speech intelligibility in stationary noise and competing speech maskers. , 2016, The Journal of the Acoustical Society of America.

[72]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[73]  Masoud Geravanchizadeh,et al.  Microscopic prediction of speech intelligibility in spatially distributed speech-shaped noise for normal-hearing listeners. , 2015, The Journal of the Acoustical Society of America.

[74]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[75]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[76]  Birger Kollmeier,et al.  Prediction of the influence of reverberation on binaural speech intelligibility in noise and in quiet. , 2011, The Journal of the Acoustical Society of America.

[77]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[78]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[79]  Patrick A. Naylor,et al.  A Single-Channel Non-Intrusive C50 Estimator Correlated With Speech Recognition Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[80]  Biing-Hwang Juang,et al.  Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.