Integrating Denoising Autoencoder and Vector Taylor Series with Auditory Masking for Speech Recognition in Noisy Conditions

We propose a new front-end feature compensation technique to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. First, a Time Delay Neural Network (TDNN) based Denoising Autoencoder (DAE) is considered to compensate the noisy features. The DAE provides good gain in performance when it has been trained using the noise present in the test utterances (“seen” conditions). However, if the noise present in the test utterance is different to what was used in the training of the DAE (“un-seen” conditions), then the performance degrades to a great extent. To improve the ASR performance in such unseen conditions, a model compensation technique, namely the Vector Taylor Series with Auditory Masking (VTS-AM) is used. We propose a new Signal-to-Noise Ratio (SNR) based measure, which can reliably choose the type of compensation to be used for best performance gain. We show that the proposed technique improves the ASR performance significantly on noise corrupted TIMIT and Librispeech databases.

[1]  Hans-Günter Hirsch,et al.  The simulation of realistic acoustic input scenarios for speech recognition systems , 2005, INTERSPEECH.

[2]  Biswajit Das,et al.  Robust front-end processing for Speech Recognition in noisy conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Marco Matassoni,et al.  An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.

[4]  Khe Chai Sim,et al.  Noise adaptive front-end normalization based on Vector Taylor Series for Deep Neural Networks in robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Yun Lei,et al.  Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions , 2014, INTERSPEECH.

[6]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Birger Kollmeier,et al.  An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Oriol Vinyals,et al.  Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[13]  Biswajit Das,et al.  Psychoacoustic model compensation for robust continuous speech recognition in additive noise , 2015, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[14]  Yannis Stylianou,et al.  Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder , 2017, INTERSPEECH.

[15]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[16]  Biswajit Das,et al.  Vector taylor series expansion with auditory masking for noise robust speech recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[17]  Erik Marchi,et al.  A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Thambipillai Srikanthan,et al.  Psychoacoustic Model Compensation for Robust Speaker Verification in Environmental Noise , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Ashish Panda A fast approach to psychoacoustic model compensation for robust speaker recognition in additive noise , 2015, INTERSPEECH.

[20]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.