Enhanced Denoising Auto-Encoder for Robust Speech Recognition in Unseen Noise Conditions

We present a robust front-end processing method for speech recognition in unseen noise conditions. Towards this end, we have investigated the efficacy of a Time Delay Neural Network based Denoising Auto-Encoder (TDNN-DAE) in seen and unseen noise conditions. We show that while the TDNN-DAE succeeds in improving the performance of the speech recognition by a large margin in seen noise conditions (noise encountered during decoding was used in the training of the TDNN-DAE), it fails to improve the performance in unseen noise conditions (noise encountered during decoding was not used in the training of the TDNN-DAE). To address this, we propose to pre-process the training input to the TDNN-DAE using an enhancement technique. In essence, the TDNN-DAE is being trained to address the residual noise left behind by the enhancement technique. For this task, we compare the performance of two enhancement techniques, namely Vector Taylor Series with Acoustic Masking (VTS-AM) and Spectral Subtraction (SS). We show that both these enhancement techniques improve the efficacy of the TDNN-DAE significantly in unseen noise conditions and that the VTS-AM enhanced TDNN-DAE outperforms the SS enhanced TDNN-DAE.

[1]  Saeed V. Vaseghi,et al.  Advanced Digital Signal Processing and Noise Reduction , 2006 .

[2]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[3]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[4]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[6]  Hans-Günter Hirsch,et al.  The simulation of realistic acoustic input scenarios for speech recognition systems , 2005, INTERSPEECH.

[7]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Elliot Saltzman,et al.  Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition , 2017, Speech Commun..

[9]  Yannis Stylianou,et al.  Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder , 2017, INTERSPEECH.

[10]  John H. L. Hansen,et al.  Robust Features in Deep-Learning-Based Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[11]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Mohamed Morchid,et al.  Denoised Bottleneck Features From Deep Autoencoders for Telephone Conversation Analysis , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[14]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Yonghong Yan,et al.  Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments , 2017, Comput. Speech Lang..

[16]  Horacio Franco,et al.  Leveraging Deep Neural Network Activation Entropy to cope with Unseen Data in Speech Recognition , 2017, ArXiv.

[17]  Dimitra Vergyri,et al.  Speech recognition in unseen and noisy channel conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Birger Kollmeier,et al.  Combining Binaural and Cortical Features for Robust Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Biswajit Das,et al.  Robust front-end processing for Speech Recognition in noisy conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[22]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[23]  Erik Marchi,et al.  A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ashish Panda A fast approach to psychoacoustic model compensation for robust speaker recognition in additive noise , 2015, INTERSPEECH.

[25]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[26]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.