Improvements to filterbank and delta learning within a deep neural network framework

Many features used in speech recognition tasks are hand-crafted and are not always related to the objective at hand, that is minimizing word error rate. Recently, we showed that replacing a perceptually motivated mel-filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network was promising. In this paper, we extend filter learning to a speaker-adapted, state-of-the-art system. First, we incorporate delta learning into the filter learning framework. Second, we incorporate various speaker adaptation techniques, including VTLN warping and speaker identity features. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter and delta learning, compared to having a fixed set of filters and deltas. Furthermore, after speaker adaptation, we find that filter and delta learning allows for a 3% relative improvement in WER compared to a state-of-the-art CNN.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[3]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[4]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[5]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Alain Biem,et al.  A discriminative filter bank model for speech recognition , 1995, EUROSPEECH.

[15]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[17]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[18]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[20]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.