Deep speech inpainting of time-frequency masks

Transient loud intrusions, often occurring in noisy environments, can completely overpower speech signal and lead to an inevitable loss of information. While existing algorithms for noise suppression can yield impressive results, their efficacy remains limited for very low signal-to-noise ratios or when parts of the signal are missing. To address these limitations, here we propose an end-to-end framework for speech inpainting, the context-based retrieval of missing or severely distorted parts of time-frequency representation of speech. The framework is based on a convolutional U-Net trained via deep feature losses, obtained using speechVGG, a deep speech feature extractor pre-trained on an auxiliary word classification task. Our evaluation results demonstrate that the proposed framework can recover large portions of missing or distorted time-frequency representation of speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach provided a substantial increase in STOI & PESQ objective metrics of the initially corrupted speech samples. Notably, using deep feature losses to train the framework led to the best results, as compared to conventional approaches.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[3]  Ronaldus Maria Aarts,et al.  Bandwidth Extension for Speech , 2005 .

[4]  DeLiang Wang,et al.  Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Antonio Bonafonte,et al.  Towards Generalized Speech Enhancement with Generative Adversarial Networks , 2019, INTERSPEECH.

[6]  Nathanael Perraudin,et al.  A Context Encoder For Audio Inpainting , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[8]  Hirokazu Kameoka,et al.  Phase initialization schemes for faster spectrogram-consistency-based signal reconstruction ∗ ◎ , 2010 .

[9]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[10]  Brian McWilliams,et al.  Spectrogram Feature Losses for Music Source Separation , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[11]  Michael Elad,et al.  Audio Inpainting , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[17]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[19]  Nicki Holighaus,et al.  Inpainting of Long Audio Segments With Similarity Graphs , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[22]  Sascha Disch,et al.  A harmonic bandwidth extension method for audio codecs , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Jonathan Le Roux,et al.  FAST SIGNAL RECONSTRUCTION FROM MAGNITUDE STFT SPECTROGRAM BASED ON SPECTROGRAM CONSISTENCY , 2010 .

[24]  Jean-Marc Valin,et al.  A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement , 2017, 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP).

[25]  Hiroshi Ishikawa,et al.  Globally and locally consistent image completion , 2017, ACM Trans. Graph..

[26]  Colin Perkins,et al.  A survey of packet loss recovery techniques for streaming audio , 1998 .

[27]  Gerhard Schmidt,et al.  Bandwidth Extension of Speech Signals , 2008, Lecture Notes in Electrical Engineering.

[28]  Yu Tsao,et al.  Incorporating Symbolic Sequential Modeling for Speech Enhancement , 2019, INTERSPEECH.

[29]  Milos Cernak,et al.  Speech-VGG: A deep feature extractor for speech processing , 2019, ArXiv.

[30]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Seyedmahdad Mirsamadi,et al.  Causal Speech Enhancement Combining Data-Driven Learning and Suppression Rule Estimation , 2016, INTERSPEECH.

[32]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[33]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Ismo Kauppinen,et al.  Audio Signal Extrapolation - Theory And Applications , 2002 .

[35]  LeeBong-Ki,et al.  Packet loss concealment based on deep neural networks for digital speech transmission , 2016 .

[36]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[37]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Ting-Chun Wang,et al.  Image Inpainting for Irregular Holes Using Partial Convolutions , 2018, ECCV.

[39]  Dinei A. F. Florêncio,et al.  Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks , 2016, INTERSPEECH.

[40]  Thomas S. Huang,et al.  Generative Image Inpainting with Contextual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.