Flat start training of CD-CTC-SMBR LSTM RNN acoustic models

We present a recipe for training acoustic models with context dependent (CD) phones from scratch using recurrent neural networks (RNNs). First, we use the connectionist temporal classification (CTC) technique to train a model with context independent (CI) phones directly from the written-domain word transcripts by aligning with all possible phonetic verbalizations. Then, we devise a mechanism to generate a set of CD phones using the CTC CI phone model alignments and train a CD phone model to improve the accuracy. This end-to-end training recipe does not require any previously trained GMM-HMM or DNN model for CD phone generation or alignment, and thus drastically reduces the overall model building time. We show that using this procedure does not degrade the performance of models and allows us to improve models more quickly by updates to pronunciations or training data.

[1]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[2]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[3]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[5]  Cyril Allauzen,et al.  Language model verbalization for automatic speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[7]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Georg Heigold,et al.  GMM-free DNN acoustic model training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[10]  Izhak Shafran,et al.  Context dependent phone models for LSTM RNN acoustic modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[13]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[14]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).