Frame accuracy is a common and natural summary statistic to use in neural-network-based ASR. It is often used as an indication of the performance of the neural network probability estimator and in the stopping criterion during its training. Though considered an important factor for word recognition, the frame accuracy presents an incomplete and sometimes deficient indicator of performance for the overall task of word recognition, as with many such summary statistics. Many in the ASR community have seen instances where an improvement in the acoustic posterior probability estimation yielded a disappointing effect on word recognition. We conducted experiments in an effort to illustrate some of the variability in word-recognition performance associated with frame accuracy. Our experiments attempt to shed light on some of the factors that might give rise to instances where frame accuracy and word error correlate. Some of the results are confirmation of intuitive or commonly known trends.
[1]
Steven Greenberg,et al.
Automatic phonetic transcription of spontaneous speech (american English)
,
2000,
INTERSPEECH.
[2]
Hervé Bourlard,et al.
Connectionist Speech Recognition: A Hybrid Approach
,
1993
.
[3]
Barry Y. Chen,et al.
On data-derived temporal processing in speech feature extraction
,
2000,
INTERSPEECH.
[4]
Tony Robinson,et al.
Time-first search for large vocabulary speech recognition
,
1998,
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[5]
Steven Greenberg,et al.
AN INTRODUCTION TO THE DIAGNOSTIC EVALUATION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS
,
2000
.