论文信息 - Using Recurrences in Time and Frequency within U-net Architecture for Speech Enhancement

Using Recurrences in Time and Frequency within U-net Architecture for Speech Enhancement

When designing fully-convolutional neural network, there is a trade-off between receptive field size, number of parameters and spatial resolution of features in deeper layers of the network. In this work we present a novel network design based on combination of many convolutional and recurrent layers that solves these dilemmas. We compare our solution with U-nets based models known from the literature and other baseline models on speech enhancement task. We test our solution on TIMIT speech utterances combined with noise segments extracted from NOISEX-92 database and show clear advantage of proposed solution in terms of SDR (signal-to-distortion ratio), SIR (signal-to-interference ratio) and STOI (spectro-temporal objective intelligibility) metrics compared to the current state-of-the-art.

Szymon Drgas | Tomasz Grzywalski

[1] T. Munich,et al. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[2] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[3] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Paris Smaragdis,et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Liang He,et al. Convolutional maxout neural networks for speech separation , 2015, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[6] Szymon Drgas,et al. Application of recurrent U-net architecture to speech enhancement , 2018, 2018 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA).

[7] Mark D. Plumbley,et al. Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[8] Björn W. Schuller,et al. Deep Learning for Environmentally Robust Speech Recognition , 2017, ACM Trans. Intell. Syst. Technol..

[9] Chin-Hui Lee,et al. Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Yoshua Bengio,et al. ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks , 2015, ArXiv.

[11] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Emmanuel Vincent,et al. Single-channel audio source separation with NMF: divergences, constraints and algorithms , 2018 .

[13] Yu Tsao,et al. Deep Learning–Based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients , 2018, Ear and hearing.

[14] DeLiang Wang,et al. Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[15] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Jinwon Lee,et al. A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[17] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[18] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[19] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Paris Smaragdis,et al. Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).