iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning

The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments. In this work, we propose a deep learning-based speech modification method to compensate for the intelligibility loss, with the constraint that the root mean square (RMS) level and duration of the speech signal are maintained before and after modifications. Specifically, we utilize an iMetricGAN approach to optimize the speech intelligibility metrics with generative adversarial networks (GANs). Experimental results show that the proposed iMetricGAN outperforms conventional state-of-the-art algorithms in terms of objective measures, i.e., speech intelligibility in bits (SIIB) and extended short-time objective intelligibility (ESTOI), under a Cafeteria noise condition. In addition, formal listening tests reveal significant intelligibility gains when both noise and reverberation exist.

[1]  Yannis Stylianou,et al.  Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression , 2012, INTERSPEECH.

[2]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[3]  Martin Vetterli,et al.  Adaptive filtering in subbands with critical sampling: analysis, experiments, and application to acoustic echo cancellation , 1992, IEEE Trans. Signal Process..

[4]  Thomas Kailath,et al.  Adaptive algorithms with an automatic gain control feature , 1988 .

[5]  Simon King,et al.  Evaluating Near End Listening Enhancement Algorithms in Realistic Environments , 2019, INTERSPEECH.

[6]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[7]  Richard C. Hendriks,et al.  A Simple Model of Speech Communication and its Application to Intelligibility Enhancement , 2015, IEEE Signal Processing Letters.

[8]  Martin Cooke,et al.  The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise , 2009, Speech Commun..

[9]  W. Bastiaan Kleijn,et al.  An Evaluation of Intrusive Instrumental Intelligibility Metrics , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Yan Tang,et al.  Glimpse-Based Metrics for Predicting Speech Intelligibility in Additive Noise Conditions , 2016, INTERSPEECH.

[11]  John R. Hershey,et al.  Exploring Tradeoffs in Models for Low-Latency Speech Enhancement , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[12]  Gaël Richard,et al.  Formant shifting for speech intelligibility improvement in car noise environment , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jesper Jensen,et al.  On Optimal Linear Filtering of Speech for Near-End Listening Enhancement , 2013, IEEE Signal Processing Letters.

[14]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[15]  James M. Kates,et al.  The Hearing-Aid Speech Perception Index (HASPI) , 2014, Speech Commun..

[16]  M. Wester The EMIME Bilingual Database , 2010 .

[17]  Jan Rennies,et al.  Speech-in-noise enhancement using amplification and dynamic range compression controlled by the speech intelligibility index. , 2015, The Journal of the Acoustical Society of America.

[18]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Richard C. Hendriks,et al.  An Instrumental Intelligibility Metric Based on Information Theory , 2017, IEEE Signal Processing Letters.

[20]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.