A comparison of audio-free speech recognition error prediction methods

Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized. Index Terms: Finite State Transducer, Automatic Speech Recognition, Error prediction

[1]  John R. Hershey,et al.  Word confusability - measuring hidden Markov model similarity , 2007, INTERSPEECH.

[2]  Javier Hernando,et al.  Detection of confusable words in automatic speech recognition , 2005, IEEE Signal Processing Letters.

[3]  Xu Wang,et al.  A GMM-based telephone channel classification for Mandarin speech recognition , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..

[4]  Andrej Ljolje,et al.  Full expansion of context-dependent networks in large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Eric Fosler-Lussier,et al.  A framework for predicting speech recognition errors , 2005, Speech Commun..

[6]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[7]  Hervé Bourlard,et al.  Hybrid HMM/ANN systems for training independent tasks: experiments on Phonebook and related improvements , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.