Deep Scattering Spectrum with deep neural networks

State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modelling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.

[1]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[2]  S. Mallat,et al.  Invariant Scattering Convolution Networks , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[4]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[6]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[7]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[8]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Adi Shraibman,et al.  Rank, Trace-Norm and Max-Norm , 2005, COLT.

[10]  László Tóth Convolutional deep rectifier neural nets for phone recognition , 2013, INTERSPEECH.

[11]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[12]  Sridhar Krishna Nemala,et al.  A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[14]  Daniel P. W. Ellis,et al.  Frequency-domain linear prediction for temporal features , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[15]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[16]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[20]  Stéphane Mallat Deep Learning by Scattering , 2013, ArXiv.

[21]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[22]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[24]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[25]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).