Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder

This paper investigates the use of perceptually-motivated subband temporal envelope (STE) features and time-delay neural network (TDNN) denoising autoencoder (DAE) to improve deep neural network (DNN)-based automatic speech recognition (ASR). STEs are estimated by full-wave rectification and low-pass filtering of band-passed speech using a Gammatone filter-bank. TDNNs are used either as DAE or acoustic models. ASR experiments are performed on Aurora-4 corpus. STE features provide 2.2% and 3.7% relative word error rate (WER) reduction compared to conventional log-mel filter-bank (FBANK) features when used in ASR systems using DNN and TDNN as acoustic models, respectively. Features enhanced by TDNN DAE are better recognized with ASR system using DNN acoustic models than using TDNN acoustic models. Improved ASR performance is obtained when features enhanced by TDNN DAE are used in ASR system using DNN acoustic models. In this scenario, using STE features provides 9.8% relative WER reduction compared to when using FBANK features.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[3]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[4]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[5]  Yuuki Tachioka,et al.  Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  DeLiang Wang,et al.  Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[7]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Eric W Healy,et al.  Role and relative contribution of temporal envelope and fine structure cues in sentence recognition by normal-hearing listeners. , 2013, The Journal of the Acoustical Society of America.

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[15]  Björn W. Schuller,et al.  Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Dominique Pastor,et al.  A novel framework for noise robust ASR using cochlear implant-like spectrally reduced speech , 2012, Speech Commun..

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[19]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[20]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[21]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Philipos C. Loizou,et al.  Mimicking the human ear , 1998, IEEE Signal Process. Mag..

[23]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Hynek Hermansky,et al.  Phoneme recognition using spectral envelope and modulation frequency features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.