Deep Convolutional Neural Network-Based Inverse Filtering Approach for Speech De-Reverberation

In this paper, we introduce a spectral-domain inverse filtering approach for single-channel speech de-reverberation using deep convolutional neural network (CNN). The main goal is to better handle realistic reverberant conditions where the room impulse response (RIR) filter is longer than the short-time Fourier transform (STFT) analysis window. To this end, we consider the convolutive transfer function (CTF) model for the reverberant speech signal. In the proposed framework, the CNN architecture is trained to directly estimate the inverse filter of the CTF model. Among various choices for the CNN structure, we consider the U-net which consists of a fully-convolutional auto-encoder network with skip-connections. Experimental results show that the proposed method provides better dereverberation performance than the prevalent benchmark algorithms under various reverberation conditions.

[1]  Hanwook Chung,et al.  NOISE-ADAPTIVE DEEP NEURAL NETWORK FOR SINGLE-CHANNEL SPEECH ENHANCEMENT , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[2]  Sharon Gannot,et al.  Speech Dereverberation Using Fully Convolutional Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[3]  Shuang Xu,et al.  Single-channel Speech Dereverberation via Generative Adversarial Training , 2018, INTERSPEECH.

[4]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[5]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Emanuel A. P. Habets,et al.  Online Speech Dereverberation Using Kalman Filter and EM Algorithm , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Feiran Yang,et al.  A Late Reverberation Power Spectral Density Aware Approach to Speech Dereverberation Based on Deep Neural Networks , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[9]  Mandy Eberhart,et al.  Speech Communications Human And Machine , 2016 .

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Tao Zhang,et al.  Late Reverberation Suppression Using Recurrent Neural Networks with Long Short-Term Memory , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Chin-Hui Lee,et al.  A Reverberation-Time-Aware Approach to Speech Dereverberation Based on Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[15]  Radu Horaud,et al.  Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Israel Cohen,et al.  Convolutive Transfer Function Generalized Sidelobe Canceler , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  A. Nabelek,et al.  Reverberant overlap- and self-masking in consonant identification. , 1989, The Journal of the Acoustical Society of America.

[18]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  DeLiang Wang,et al.  Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Jung-Woo Ha,et al.  Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[23]  Ina Kodrasi,et al.  Single-channel Late Reverberation Power Spectral Density Estimation Using Denoising Autoencoders , 2018, INTERSPEECH.

[24]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[26]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[27]  Mark B. Sandler,et al.  Database of omnidirectional and B-format room impulse responses , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Tao Zhang,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Wei-Ping Zhu,et al.  Speech dereverberation using linear prediction with estimation of early speech spectral variance , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.