Dilated convolutional recurrent neural network for monaural speech enhancement

In this study, we have proposed a novel dilated convolutional recurrent neural network for real-time monaural speech enhancement. Our proposed model incorporates dilated causal convolutions with a long short-term memory (LSTM) layer and skip connections to track a target speaker from a single channel noisy-reverberant mixture. Our model was evaluated in simulated rooms with different reverberation times and unseen background noises. Experimental results show significant improvements in objective speech intelligibility and speech quality of the enhanced speech using proposed model compared to LSTM, gated residual network (GRN) and convolutional recurrent network (CRN) models. Moreover, this model has better generalization to untrained speakers and unseen noises compared to LSTM, GRN and CRN, while it has fewer trainable parameters and can operate in real-time applications.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[3]  Shadi Pirhosseinloo,et al.  A new feature set for masking-based monaural speech separation , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[4]  Kostas Kokkinakis,et al.  An Interaural Magnification Algorithm for Enhancement of Naturally-Occurring Level Differences , 2016, INTERSPEECH.

[5]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[6]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[7]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[10]  Sh Pirhosseinloo A Combination of Maximum Likelihood Bayesian Framework and Discriminative Linear Transforms for Speaker Adaptation , 2012 .

[11]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Kostas Kokkinakis,et al.  Time-Frequency Masking for Blind Source Separation with Preserved Spatial Cues , 2017, INTERSPEECH.

[13]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[14]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[16]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[17]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[18]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[19]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[20]  DeLiang Wang,et al.  Gated Residual Networks With Dilated Convolutions for Monaural Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[22]  Shadi Pirhosseinloo,et al.  Discriminative speaker adaptation in Persian continuous speech recognition systems , 2012 .

[23]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  Shadi Pirhosseinloo,et al.  Monaural Speech Enhancement with Dilated Convolutions , 2019, INTERSPEECH.

[25]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[26]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[27]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).