Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement

The capability of the human to pay attention to both coarse and fine-grained regions has been applied to computer vision tasks. Motivated by that, we propose a collaborative learning framework in the complex domain for monaural noise suppression. The proposed system consists of two principal modules, namely spectral feature extraction module (FEM) and stacked glance-gaze modules (GGMs). In FEM, the UNet-block is introduced after each convolution layer, enabling the feature recalibration from multiple scales. In each GGM, we decompose the multi-target optimization in the complex spectrum into two sub-tasks. Specifically, the glance path aims to suppress the noise in the magnitude domain to obtain a coarse estimation, and meanwhile, the gaze path attempts to compensate for the lost spectral detail in the complex domain. The two paths work collaboratively and facilitate spectral estimation from complementary perspectives. Besides, by repeatedly unfolding the GGMs, the intermediate result can be iteratively refined across stages and lead to the ultimate estimation of the spectrum. The experiments are conducted on the WSJ0-SI84, DNS-Challenge dataset, and Voicebank+Demand dataset. Results show that the proposed approach achieves state-of-theart performance over previous advanced systems on the WSJ0-SI84 and DNS-Challenge dataset, and meanwhile, competitive performance is achieved on the Voicebank+Demand corpus.

[1]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Zhong-Qiu Wang,et al.  Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Lei Xie,et al.  DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement , 2021, Interspeech.

[5]  Nils L. Westhausen,et al.  Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression , 2020, INTERSPEECH.

[6]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7]  Yu Tsao,et al.  MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[8]  Jean-Marc Valin,et al.  PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[11]  Hong-Goo Kang,et al.  A Joint Learning Algorithm for Complex-Valued T-F Masks in Deep Learning-Based Single-Channel Speech Enhancement Systems , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[14]  Zhiyu Wang,et al.  Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Hyeong-Seok Choi,et al.  Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Zichen Zhang,et al.  U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection , 2020, Pattern Recognit..

[18]  Scott Wisdom,et al.  Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Chandan K. A. Reddy,et al.  The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results , 2020, INTERSPEECH.

[20]  Le Yang,et al.  Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification , 2020, NeurIPS.

[21]  Chengshi Zheng,et al.  Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  DeLiang Wang,et al.  Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[25]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[27]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[28]  Chengshi Zheng,et al.  On the importance of power compression and phase estimation in monaural speech dereverberation. , 2021, JASA express letters.

[29]  Sabato Marco Siniscalchi,et al.  Vector-to-Vector Regression via Distributional Loss for Speech Enhancement , 2021, IEEE Signal Processing Letters.

[30]  Jun Du,et al.  SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[31]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[32]  Xiaodong Li,et al.  A Supervised Speech enhancement Approach with Residual Noise Control for Voice Communication , 2019, ArXiv.

[33]  Yannis Stylianou,et al.  Advances in phase-aware signal processing in speech communication , 2016, Speech Commun..

[34]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[35]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[36]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[37]  DeLiang Wang,et al.  Dense CNN With Self-Attention for Time-Domain Speech Enhancement , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[40]  Xiaodong Li,et al.  Speech enhancement using progressive learning-based convolutional recurrent neural network , 2020 .

[41]  Chong Luo,et al.  Joint Time-Frequency and Time Domain Learning for Speech Enhancement , 2020, IJCAI.

[42]  Yu Tsao,et al.  WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-End Speech Enhancement , 2020, IEEE Signal Processing Letters.

[43]  Radu Horaud,et al.  Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[45]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  Emmanuel Vincent,et al.  First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results , 2007, ICA.

[47]  Kuldip K. Paliwal,et al.  DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Ross Cutler,et al.  Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.