Prediction of Subjective Listening Effort from Acoustic Data with Non-Intrusive Deep Models

The effort of listening to spoken language is a highly important perceptive measure for the design of speech enhancement algorithms and hearing-aid processing. In previous research, we proposed a model that quantifies the phoneme output probabilities obtained from a deep neural net (DNN), which resulted in accurate predictions for unseen speech samples. However, high correlations between subjective ratings and model output were observed in known noise types, which is an unrealistic assumption in real-life scenarios. This paper explores non-intrusive listening effort prediction in unseen noisy environments. A set of different noise types are used for training a standard automatic speech recognition (ASR) system. Model predictions are produced by measuring the mean temporal distances of phoneme vectors from the DNN. These are compared to subjective ratings of hearing-impaired and normal-hearing listener responses from three databases that cover a variety of noise types and signal enhancement algorithms. We obtain an average correlation of 0.88 and outperform three baseline measures in most conditions.

[1]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[2]  Jan Rennies,et al.  Perceived listening effort and speech intelligibility in reverberation and noise for hearing-impaired listeners , 2016, International journal of audiology.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Hynek Hermansky,et al.  Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Pascal Scalart,et al.  Improved Signal-to-Noise Ratio Estimation for Speech Enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Doh-Suk Kim,et al.  ANIQUE+: A new American national standard for non-intrusive estimation of narrowband speech quality , 2007, Bell Labs Technical Journal.

[7]  Tetsuji Ogawa,et al.  Uncertainty estimation of DNN classifiers , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[8]  Florian Denk,et al.  Enhanced forensic multiple speaker recognition in the presence of coloured noise , 2014, 2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS).

[9]  A. Zekveld,et al.  Pupil Dilation Uncovers Extra Listening Effort in the Presence of a Single-Talker Masker , 2012, Ear and hearing.

[10]  Thomas Brand,et al.  Development of an adaptive scaling method for subjective listening effort. , 2017, The Journal of the Acoustical Society of America.

[11]  Sridhar Kalluri,et al.  Objective measures of listening effort: effects of background noise and noise reduction. , 2009, Journal of speech, language, and hearing research : JSLHR.

[12]  J M Festen,et al.  Assessing aspects of auditory handicap by means of pupil dilatation. , 1997, Audiology : official organ of the International Society of Audiology.

[13]  Birger Kollmeier,et al.  Comparison of single-microphone noise reduction schemes: can hearing impaired listeners tell the difference? , 2018, International journal of audiology.

[14]  Jesper Jensen,et al.  DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[15]  Carol L Mackersie,et al.  Effects of Hearing Loss on Heart Rate Variability and Skin Conductance Measured During Sentence Recognition in Noise , 2015, Ear and hearing.

[16]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[17]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[18]  B. Meyer,et al.  Single-ended prediction of listening effort using deep neural networks , 2017, Hearing Research.

[19]  Wouter A. Dreschler,et al.  ICRA Noises: Artificial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos , 2001 .

[20]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Birger Kollmeier,et al.  Listening effort and speech intelligibility in listening situations affected by noise and reverberation. , 2014, The Journal of the Acoustical Society of America.

[22]  Constantin Spille,et al.  Single-Ended Prediction of Listening Effort Based on Automatic Speech Recognition , 2017, INTERSPEECH.

[23]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[24]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.