A framework for predicting speech recognition errors

Pronunciation modeling in automatic speech recognition systems has had mixed results in the past; one likely reason for poor performance is the increased confusability in the lexicon from adding new pronunciation variants. In this work, we propose a new framework for determining lexically confusable words based on inverted finite state transducers (FSTs); we also present experiments designed to test some of the implementation details of this framework. The method is evaluated by examining how well the algorithm predicts the errors in an ASR system. The model is able to generalize confusions learned from a training set to predict errors made by the speech recognizer on an unseen test set.

[1]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Andrej Ljolje,et al.  Full expansion of context-dependent networks in large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Eric Fosler-Lussier,et al.  A comparison of data-derived and knowledge-based modeling of pronunciation variation , 2000, INTERSPEECH.

[4]  Alex Acero,et al.  Estimating speech recognition error rate without acoustic test data , 2003, INTERSPEECH.

[5]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[6]  Steven Greenberg,et al.  AN INTRODUCTION TO THE DIAGNOSTIC EVALUATION OF SWITCHBOARD-CORPUS AUTOMATIC SPEECH RECOGNITION SYSTEMS , 2000 .

[7]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  James R. Glass,et al.  Lexical modeling of non-native speech for automatic speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Victor Zue,et al.  Context-dependent probabilistic hierarchical sublexical modelling using finite state transducers , 2001, INTERSPEECH.

[10]  I. Lee Hetherington New words: effect on recognition performance and incorporation issues , 1995, EUROSPEECH.

[11]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[12]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[13]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[14]  C. Cucchiarini Assessing transcription agreement: Methodological aspects , 1996 .

[15]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Chin-Hui Lee,et al.  Discriminative training of language models for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Vaibhava Goel,et al.  Segmental minimum Bayes-risk decoding for automatic speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[18]  Alexander H. Waibel,et al.  Dictionary learning for spontaneous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[20]  Chin-Hui Lee,et al.  Speech technology integration and research platform: a system study , 1997, EUROSPEECH.

[21]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[22]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Julia Hirschberg,et al.  Prosodic cues to recognition errors , 1999 .

[24]  Helmer Strik,et al.  Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation , 1999, Speech Commun..

[25]  Peder A. Olsen,et al.  Theory and practice of acoustic confusability , 2002, Comput. Speech Lang..

[26]  Hauke Schramm,et al.  DISCRIMINATIVE OPTIMIZATION OF THE LEXICAL MODEL , 2002 .

[27]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[28]  Lin Lawrance Chase Error-responsive feedback mechanisms for speech recognizers , 1997 .

[29]  Torbjørn Svendsen,et al.  Maximum likelihood modelling of pronunciation variation , 1999, Speech Commun..

[30]  Hong-Kwang Jeff Kuo,et al.  Dialogue management in the Bell Labs communicator system , 2000, INTERSPEECH.

[31]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[32]  Mingjing Li,et al.  Discriminative training on language model , 2000, INTERSPEECH.

[33]  Steven Greenberg,et al.  INSIGHTS INTO SPOKEN LANGUAGE GLEANED FROM PHONETIC TRANSCRIPTION OF THE SWITCHBOARD CORPUS , 1996 .