Convolutional neural networks for acoustic modeling of raw time signal in LVCSR

In this paper we continue to investigate how the deep neural network (DNN) based acoustic models for automatic speech recognition can be trained without hand-crafted feature extraction. Previously, we have shown that a simple fully connected feedforward DNN performs surprisingly well when trained directly on the raw time signal. The analysis of the weights revealed that the DNN has learned a kind of short-time time-frequency decomposition of the speech signal. In conventional feature extraction pipelines this is done manually by means of a filter bank that is shared between the neighboring analysis windows. Following this idea, we show that the performance gap between DNNs trained on spliced hand-crafted features and DNNs trained on raw time signal can be strongly reduced by introducing 1D-convolutional layers. Thus, the DNN is forced to learn a short-time filter bank shared over a longer time span. This also allows us to interpret the weights of the second convolutional layer in the same way as 2D patches learned on critical band energies by typical convolutional neural networks. The evaluation is performed on an English LVCSR task. Trained on the raw time signal, the convolutional layers allow to reduce the WER on the test set from 25.5% to 23.4%, compared to an MFCC based result of 22.1% using fully connected layers. Index Terms: acoustic modeling, raw time signal, convolutional neural networks

[1]  Nelson Morgan,et al.  Robust CNN-based speech recognition with Gabor filter kernels , 2014, INTERSPEECH.

[2]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[3]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[6]  Alain Biem,et al.  Filter bank design based on discriminative feature extraction , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Hermann Ney,et al.  Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[11]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[12]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[14]  Hermann Ney,et al.  RASR/NN: The RWTH neural network toolkit for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[17]  Hermann Ney,et al.  RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit , 2011 .

[18]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  George Saon,et al.  Neural network acoustic models for the DARPA RATS program , 2013, INTERSPEECH.

[20]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.