Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition

We have trained and tested a number of large neural networks for the purpose of emission probability estimation in large vocabulary continuous speech recognition. In particular, the problem under test is the DARPA Broadcast News task. Our goal here was to determine the relationship between training time, word error rate, size of the training set, and size of the neural network. In all cases, the network architecture was quite simple, comprising a single large hidden layer with an input window consisting of feature vectors from 9 frames around the current time, with a single output for each of 54 phonetic categories. Thus far, simultaneous increases to the size of the training set and the neural network improve performance; in other words, more data helps, as does the training of more parameters. We continue to be surprised that such a simple system works as well as it does for complex tasks. Given a limitation in training time, however, there appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances. Additionally, doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems.

[1]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[2]  Horacio Franco,et al.  Context-Dependent Multiple Distribution Phonetic Modeling with MLPs , 1992, NIPS.

[3]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Herman J. M. Steeneken,et al.  Multi-lingual assessment of speaker independent large vocabulary speech-recognition systems: THE SQALE-PROJECT , 1995, EUROSPEECH.

[5]  Anthony J. Robinson,et al.  Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System , 1995, NIPS.

[6]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[7]  Steve Renals,et al.  Efficient search using posterior phone probability estimates , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[9]  Biing-Hwang Juang,et al.  1997 IEEE Workshop on Automatic Speech Recognition and Understanding : proceedings , 1997 .

[10]  J. Fritsch,et al.  ACID/HNN: a framework for hierarchical connectionist acoustic modeling , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[11]  Thomas Hain,et al.  The 1997 HTK broadcast news transcription system , 1998 .

[12]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..