Context dependent phone models for LSTM RNN acoustic modelling

Long Short Term Memory Recurrent Neural Networks (LSTM RNNs), combined with hidden Markov models (HMMs), have recently been show to outperform other acoustic models such as Gaussian mixture models (GMMs) and deep neural networks (DNNs) for large scale speech recognition. We argue that using multi-state HMMs with LSTM RNN acoustic models is an unnecessary vestige of GMM-HMM and DNN-HMM modelling since LSTM RNNs are able to predict output distributions through continuous, instead of piece-wise stationary, modelling of the acoustic trajectory. We demonstrate equivalent results for context independent whole-phone or 3-state models and show that minimum-duration modelling can lead to improved results. We go on to show that context dependent whole-phone models can perform as well as context dependent states, given a minimum duration model.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[6]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[7]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[8]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Georg Heigold,et al.  GMM-Free DNN Training , 2014 .

[12]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[13]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[14]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[15]  Izhak Shafran,et al.  Learning a Discriminative Weighted Finite-State Transducer for Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.