Progressive Speech Enhancement with Residual Connections

This paper studies the Speech Enhancement based on Deep Neural Networks. The proposed architecture gradually follows the signal transformation during enhancement by means of a visualization probe at each network block. Alongside the process, the enhancement performance is visually inspected and evaluated in terms of regression cost. This progressive scheme is based on Residual Networks. During the process, we investigate a residual connection with a constant number of channels, including internal state between blocks, and adding progressive supervision. The insights provided by the interpretation of the network enhancement process leads us to design an improved architecture for the enhancement purpose. Following this strategy, we are able to obtain speech enhancement results beyond the state-of-the-art, achieving a favorable trade-off between dereverberation and the amount of spectral distortion.

[1]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[2]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[3]  Tiago H. Falk,et al.  Investigating the effect of residual and highway connections in speech enhancement models , 2018 .

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tomohiro Nakatani,et al.  Neural Network-Based Spectrum Estimation for Online WPE Dereverberation , 2017, INTERSPEECH.

[7]  Prasanta Kumar Ghosh,et al.  Speech Enhancement Using Multiple Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Emmanuel Vincent,et al.  VoiceHome-2, an extended corpus for multichannel speech processing in real homes , 2019, Speech Commun..

[9]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[10]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[11]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[12]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[14]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[15]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Eduardo Lleida,et al.  Wide Residual Networks 1D for Automatic Text Punctuation , 2018, IberSPEECH.

[17]  Mark Hasegawa-Johnson,et al.  Speech Enhancement Using Bayesian Wavenet , 2017, INTERSPEECH.

[18]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[19]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[21]  Tiago H. Falk,et al.  A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Emmanuel Vincent,et al.  A French Corpus for Distant-Microphone Speech Processing in Real Homes , 2016, INTERSPEECH.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.

[26]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[27]  Ming Tu,et al.  Speech enhancement based on Deep Neural Networks with skip connections , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Philipos C. Loizou,et al.  Speech Quality Assessment , 2011, Multimedia Analysis, Processing and Communications.

[29]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .