Noise Robust IOA/CAS Speech Separation and Recognition System For The Third 'CHIME' Challenge

This paper presents the contribution to the third 'CHiME' speech separation and recognition challenge including both front-end signal processing and back-end speech recognition. In the front-end, Multi-channel Wiener filter (MWF) is designed to achieve background noise reduction. Different from traditional MWF, optimized parameter for the tradeoff between noise reduction and target signal distortion is built according to the desired noise reduction level. In the back-end, several techniques are taken advantage to improve the noisy Automatic Speech Recognition (ASR) performance including Deep Neural Network (DNN), Convolutional Neural Network (CNN) and Long short-term memory (LSTM) using medium vocabulary, Lattice rescoring with a big vocabulary language model finite state transducer, and ROVER scheme. Experimental results show the proposed system combining front-end and back-end is effective to improve the ASR performance.

[1]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[2]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Biing-Hwang Juang,et al.  Recurrent deep neural networks for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[7]  Marc Moonen,et al.  Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction , 2007, Speech Commun..

[8]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[10]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[11]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Marc Moonen,et al.  Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction , 2003, Signal Process..

[14]  Marc Moonen,et al.  GSVD-based optimal filtering for single and multimicrophone speech enhancement , 2002, IEEE Trans. Signal Process..

[15]  Marc Moonen,et al.  GSVD-Based Optimal Filtering for Multi-Microphone Speech Enhancement , 2001, Microphone Arrays.

[16]  X. Mestre,et al.  On diagonal loading for minimum variance beamformers , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[17]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.