A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition

We present a pre-processing speech enhancement network architecture for noise-robust speech recognition by learning progressive multiple targets (PMTs). PMTs are represented by a series of progressive ratio masks (PRMs) and progressively enhanced log-power spectra (PELPS) targets at various layers based on different signal-to-noise-ratios (SNRs), attempting to make a tradeoff between reduced background noises and increased speech distortions. As a PMT implementation, long short-term memory (LSTM) is adopted at each network layer to progressively learn intermediate dual targets of both PRM and PELPS. Experiments on the CHiME-4 automatic speech recognition (ASR) task, when compared to unprocessed speech using multi-condition trained LSTM-based acoustic models without retraining, show that PRM-only as the learning target can achieve a relative word error rate (WER) reduction of 6.32% (from 27.68% to 25.93%) averaging over the RealData evaluation set, while conventional ideal ration masks severely degrade the ASR performance. Moreover, the proposed LSTM-based PMT network, with the best configuration, outperforms the PRM-only model, with a relative WER reduction of 13.31% (further down to 22.48 %) averaging over the same test set.

[1]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  DeLiang Wang,et al.  Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[3]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[4]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Hao Tang,et al.  A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition , 2018, INTERSPEECH.

[6]  Jun Du,et al.  On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones , 2017, INTERSPEECH.

[7]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  DeLiang Wang,et al.  A Direct Masking Approach to Robust ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[11]  Jun Du,et al.  Densely Connected Progressive Learning for LSTM-Based Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yifan Gong,et al.  Improved cepstra minimum-mean-square-error noise reduction algorithm for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[15]  Shinji Watanabe,et al.  Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline , 2018, INTERSPEECH.

[16]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Chris Donahue,et al.  Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  DeLiang Wang,et al.  Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jun Du,et al.  SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[20]  Tomohiro Nakatani,et al.  Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? , 2013, INTERSPEECH.

[21]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[23]  Jun Du,et al.  A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  T. Yoshioka,et al.  Environmentally robust ASR front-end for deep neural network acoustic models , 2015, Comput. Speech Lang..

[26]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[27]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .