Architectures for deep neural network based acoustic models defined over windowed speech waveforms

This paper investigates acoustic models for automatic speech recognition (ASR) using deep neural networks (DNNs) whose input is taken directly from windowed speech waveforms (WSW). After demonstrating the ability of these networks to automatically acquire internal representations that are similar to mel-scale filter-banks, an investigation into efficient DNN architectures for exploiting WSW features is performed. First, a modified bottleneck DNN architecture is investigated to capture dynamic spectrum information that is not well represented in the time domain signal. Second,the redundancies inherent in WSW based DNNs are considered. The performance of acoustic models defined over WSW features is compared to that obtained from acoustic models defined over mel frequency spectrum coefficient (MFSC) features on the Wall Street Journal (WSJ) speech corpus. It is shown that using WSW features results in a 3.0 percent increase in WER relative to that resulting from MFSC features on the WSJ corpus. However, when combined with MFSC features, a reduction in WER of 4.1 percent is obtained with respect to the best evaluated MFSC based DNN acoustic model.

[1]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[3]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Martin Karafiát,et al.  Hierarchical neural net architectures for feature extraction in ASR , 2010, INTERSPEECH.

[5]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[6]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hynek Hermansky,et al.  Modulation Spectrum in Speech Processing , 1998 .

[10]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[11]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[12]  Wonkyum Lee,et al.  Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.