Monaural Speech Enhancement with Dilated Convolutions

In this study, we propose a novel dilated convolutional neural network for enhancing speech in noisy and reverberant environments. The proposed model incorporates dilated convolutions for tracking a target speaker through context aggregations, skip connections, and residual learning for mapping-based monaural speech enhancement. The performance of our model was evaluated in a variety of simulated environments having different reverberation times and quantified using two objective measures. Experimental results show that the proposed model outperforms a long short-term memory (LSTM), a gated residual network (GRN) and convolutional recurrent network (CRN) model in terms of objective speech intelligibility and speech quality in noisy and reverberant environments. Compared to LSTM, CRN and GRN, our method has improved generalization to untrained speakers and noise, and has fewer training parameters resulting in greater computational efficiency.

[1]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[2]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[3]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[4]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[5]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[9]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[13]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[14]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  DeLiang Wang,et al.  Features for Masking-Based Monaural Speech Separation in Reverberant Conditions , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Shadi Pirhosseinloo,et al.  A new feature set for masking-based monaural speech separation , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[18]  Kostas Kokkinakis,et al.  An Interaural Magnification Algorithm for Enhancement of Naturally-Occurring Level Differences , 2016, INTERSPEECH.

[19]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Kostas Kokkinakis,et al.  Time-Frequency Masking for Blind Source Separation with Preserved Spatial Cues , 2017, INTERSPEECH.

[21]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[22]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[23]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.