Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Currently, low signal-to-noise ratio (SNR) and non-stationary noise cause severe performance degradation for most of speech enhancement models. For better speech enhancement at the above scenarios, this paper proposes a two-stage approach that consists of binary masking and spectrogram inpainting. In the binary masking stage, we first obtain binary mask by hardening soft mask and then use it to remove time-frequency points that are dominated by severe noise. In the spectrogram inpainting stage, we use a CNN with partial convolution to perform inpainting on the masked spectrogram from the previous stage. We compared our approach with two powerful baselines, including Wave-U-Net and CRN, on a low SNR dataset containing lots of non-stationary noises. The experimental results show that our approach outperformed the baselines and achieved the state-of-the-art performance.

[1]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[6]  A. Sreenivasa Murthy,et al.  Comparison of Speech Enhancement Algorithms , 2016 .

[7]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[8]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  Ting-Chun Wang,et al.  Partial Convolution based Padding , 2018, ArXiv.

[10]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[11]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[12]  Ting-Chun Wang,et al.  Image Inpainting for Irregular Holes Using Partial Convolutions , 2018, ECCV.

[13]  Chao Yang,et al.  Contextual-Based Image Inpainting: Infer, Match, and Translate , 2017, ECCV.

[14]  Bernhard Schölkopf,et al.  EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[16]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[17]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.