论文信息 - Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net

Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net

Modern deep learning-based models have seen outstanding performance improvement with speech enhancement tasks. The number of parameters of state-of-the-art models, however, is often too large to be deployed on devices for real-world applications. To this end, we propose Tiny Recurrent U-Net (TRU-Net), a lightweight online inference model that matches the performance of current state-of- the-art models. The size of the quantized version of TRU-Net is 362 kilobytes, which is small enough to be deployed on edge devices. In addition, we combine the small-sized model with a new masking method called phase-aware ß-sigmoid mask, which enables simultaneous denoising and dereverberation. Results of both objective and subjective evaluations have shown that our model can achieve competitive performance with the current state-of-the-art models on benchmark datasets using fewer parameters by orders of magnitude.

[1] Sebastian Braun,et al. Data Augmentation and Loss Normalization for Deep Noise Suppression , 2020, SPECOM.

[2] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Nils L. Westhausen,et al. Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression , 2020, INTERSPEECH.

[4] Rainer Martin,et al. Phase estimation for signal reconstruction in single-channel source separation , 2012, INTERSPEECH.

[5] Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[6] John R. Hershey,et al. Exploring Tradeoffs in Models for Low-Latency Speech Enhancement , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[7] Richard F. Lyon,et al. Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[10] Jesper Jensen,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11] Szymon Drgas,et al. Using Recurrences in Time and Frequency within U-net Architecture for Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Jung-Woo Ha,et al. Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[13] Raghuraman Krishnamoorthi,et al. Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[14] Lei Xie,et al. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[15] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[16] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[17] Chenjie Gu,et al. DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[18] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[19] Vincent Lostanlen,et al. Per-Channel Energy Normalization: Why and How , 2019, IEEE Signal Processing Letters.

[20] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[21] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Jian Yao,et al. Coarse-to-fine Optimization for Speech Enhancement , 2019, INTERSPEECH.

[23] Hakan Erdogan,et al. Investigations on Data Augmentation and Loss Functions for Deep Learning Based Speech-Background Separation , 2018, INTERSPEECH.

[24] Xin Wang,et al. Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Sebastian Braun,et al. Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] DeLiang Wang,et al. Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Sebastian Braun,et al. ICASSP 2021 Deep Noise Suppression Challenge , 2020 .

[28] Matthew Mattina,et al. TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[29] Jean-Marc Valin,et al. PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[30] Ivan Dokmanic,et al. Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).