Towards Intelligibility-Oriented Audio-Visual Speech Enhancement

Existing deep learning (DL) based speech enhancement approaches are generally optimised to minimise the distance between clean and enhanced speech features. These often result in improved speech quality however they suffer from a lack of generalisation and may not deliver the required speech intelligibility in real noisy situations. In an attempt to address these challenges, researchers have explored intelligibility-oriented (I-O) loss functions and integration of audio-visual (AV) information for more robust speech enhancement (SE). In this paper, we introduce DL based I-O SE algorithms exploiting AV information, which is a novel and previously unexplored research direction. Specifically, we present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function. To the best of our knowledge, this is the first work that exploits the integration of AV modalities with an I-O based loss function for SE. Comparative experimental results demonstrate that our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions, in terms of standard objective evaluation measures when dealing with unseen speakers and noises. 1

[1]  Qin Zhang,et al.  Noise Reduction Based on Robust Principal Component Analysis , 2014 .

[2]  Yonghong Yan,et al.  Comparative intelligibility investigation of single-channel noise-reduction algorithms for Chinese, Japanese, and English. , 2011, The Journal of the Acoustical Society of America.

[3]  Guanglu Sun,et al.  Spectrum enhancement with sparse coding for robust speech recognition , 2015, Digit. Signal Process..

[4]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Xiao Chen,et al.  A Robust Audio-Visual Speech Enhancement Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jesper Jensen,et al.  On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Chin-Hui Lee,et al.  A Cross-Task Transfer Learning Approach to Adapting Deep Speech Enhancement Models to Unseen Background Noise Using Paired Senone Classifiers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Xiaofei Wang,et al.  Multi-Stream End-to-End Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Paris Smaragdis,et al.  Static and Dynamic Source Separation Using Nonnegative Factorizations: A unified view , 2014, IEEE Signal Processing Magazine.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yu Tsao,et al.  Speech enhancement using segmental nonnegative matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jon Barker,et al.  DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation , 2018, INTERSPEECH.

[13]  Amir Hussain,et al.  Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System , 2020, INTERSPEECH.

[14]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[15]  Yu Tsao,et al.  STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model , 2020, ArXiv.

[16]  Amir Hussain,et al.  Deep Neural Network Driven Binaural Audio Visual Speech Separation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[17]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[18]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[21]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[22]  2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , 2018 .

[23]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[24]  Yu Tsao,et al.  A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom , 2017, IEEE Access.

[25]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  DeLiang Wang,et al.  Deep learning reinvents the hearing aid , 2017, IEEE Spectrum.

[27]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[28]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Anil C. Kokaram,et al.  ViSQOL: an objective speech quality model , 2015, EURASIP J. Audio Speech Music. Process..

[32]  Z. Tan,et al.  An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Junfeng Li,et al.  Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication , 2011, Speech Commun..