Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones

In mobile speech communication, speech signals can be severely corrupted by background noise when the far-end talker is in a noisy acoustic environment. To suppress background noise, speech enhancement systems are typically integrated into mobile phones, in which one or more microphones are deployed. In this study, we propose a novel deep learning based approach to real-time speech enhancement for dual-microphone mobile phones. The proposed approach employs a new densely-connected convolutional recurrent network to perform dual-channel complex spectral mapping. We utilize a structured pruning technique to compress the model without significantly degrading the enhancement performance, which yields a low-latency and memory-efficient enhancement system for real-time processing. Experimental results suggest that the proposed approach consistently outperforms an earlier approach to dual-channel speech enhancement for mobile phone communication, as well as a deep learning based beamformer.

[1]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[2]  Nima Tajbakhsh,et al.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation , 2018, DLMIA/ML-CDS@MICCAI.

[3]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Zhong-Qiu Wang,et al.  Multi-Microphone Complex Spectral Mapping for Speech Dereverberation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Christophe Beaugeant,et al.  Noise reduction for dual-microphone mobile phones exploiting power level differences , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[7]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yu Wang,et al.  Exploring the Granularity of Sparsity in Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Zhong-Hua Fu,et al.  Dual-microphone noise reduction for mobile phone application , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[13]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[14]  Tom Barker,et al.  Low latency sound source separation using convolutional recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[16]  Yonghong Yan,et al.  A fast two-microphone noise reduction algorithm based on power level ratio for mobile phone , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[17]  DeLiang Wang,et al.  Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Li Zhao,et al.  Efficient Sequence Learning with Group Recurrent Networks , 2018, NAACL.

[21]  DeLiang Wang,et al.  Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in The Time Domain , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[23]  Ángel M. Gómez,et al.  Unscented Transform-Based Dual-Channel Noise Estimation: Application to Speech Enhancement on Smartphones , 2018, 2018 41st International Conference on Telecommunications and Signal Processing (TSP).

[24]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006, IEEE Trans. Neural Networks.

[28]  DeLiang Wang,et al.  Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[31]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[32]  Yung-Yue Chen Speech Enhancement of Mobile Devices Based on the Integration of a Dual Microphone Array and a Background Noise Elimination Algorithm , 2018, Sensors.

[33]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[34]  Ángel M. Gómez,et al.  A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition , 2014, IberSPEECH.

[35]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Christian Ledig,et al.  Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize , 2017, ArXiv.

[37]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Ahmad Akbari,et al.  Using power level difference for near field dual-microphone speech enhancement , 2009 .

[39]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[41]  Zhong-Qiu Wang,et al.  All-Neural Multi-Channel Speech Enhancement , 2018, INTERSPEECH.