论文信息 - Low-Latency approximation of bidirectional recurrent networks for speech denoising

Low-Latency approximation of bidirectional recurrent networks for speech denoising

The ability to separate speech from non-stationary background disturbances using only a single channel of information has increased significantly with the adoption of deep learning techniques. In these approaches, a time-frequency mask that recovers clean speech from noisy mixtures is learned from data. Recurrent neural networks are particularly well-suited to this sequential prediction task, with the bidirectional variant (e.g., BLSTM) achieving strong results. The downside of bidirectional models is that they require offline operation to perform both a forward and backward pass over the data. In this paper we compare two different low-latency bidirectional approximations. The first uses block processing with a regular bidirectional network, while the second uses the recently proposed looka-head convolution layer. Our results show that using just 1000 ms of backward context can recover approximately 75% of the performance improvement gained from using bidirectional as opposed to forward-only recurrent networks.

Gordon Wichern | Alexey Lukin | G. Wichern | Alexey Lukin

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[3] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Ning Ma,et al. The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[5] Bhiksha Raj,et al. Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[6] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[8] Dan Stowell,et al. Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[9] Daniel P. W. Ellis,et al. MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[10] DeLiang Wang,et al. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[11] DeLiang Wang,et al. Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[12] Li-Rong Dai,et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Björn W. Schuller,et al. Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[14] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15] Tuomas Virtanen,et al. TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[16] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] K. Nakadai. Detection and Classification of Acoustic Scenes and Events 2018 Challenge PARTIALLY-SHARED CONVOLUTIONAL NEURAL NETWORK FOR CLASSIFICATION OF MULTI-CHANNEL RECORDED AUDIO SIGNALS Technical Report , 2018 .

[18] Paris Smaragdis,et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19] Chong Wang,et al. Lookahead Convolution Layer for Unidirectional Recurrent Neural Networks , 2016 .

[20] Gautham J. Mysore,et al. Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges , 2015, IEEE Signal Processing Letters.

[21] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[22] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23] Franz Pernkopf,et al. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario , 2011, INTERSPEECH.