Frame-Synchronous and Local Confidence Measures for Automatic Speech Recognition

In this paper, we introduce two new confidence measures for large vocabulary speech recognition systems. The major feature of these measures is that they can be computed without waiting for the end of the audio stream. We proposed two kinds of confidence measures: frame-synchronous and local. The frame-synchronous ones can be computed as soon as a frame is processed by the recognition engine and are based on a likelihood ratio. The local measures estimate a local posterior probability in the vicinity of the word to analyze. We evaluated our confidence measures within the framework of the automatic transcription of French broadcast news with the EER criterion. Our local measures achieved results very close to the best state-of-the-art measure (EER of 23% compared to 22.0%). We then conducted a preliminary experiment to assess the contribution of our confidence measure in improving the comprehension of an automatic transcription for the hearing impaired. We introduced several modalities to highlight words of low confidence in this transcription. We showed that these modalities used with our local confidence measure improved the comprehension of automatic transcription.

[1]  Danning Jiang,et al.  Utterance verification using improved confidence measures based on alignment confusion rate in Chinese digits recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Chin-Hui Lee,et al.  A study on word detector design and knowledge-based pruning and rescoring , 2007, INTERSPEECH.

[3]  Stephen J. Cox,et al.  Confidence measures for the SWITCHBOARD database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[5]  Simon King,et al.  Term-dependent confidence for out-of-vocabulary term detection , 2009, INTERSPEECH.

[6]  Hui Sun,et al.  Using word confidence measure for OOV words detection in a spontaneous spoken dialog system , 2003, INTERSPEECH.

[7]  Pietro Laface,et al.  Word confidence using duration models , 2009, INTERSPEECH.

[8]  Hermann Ney,et al.  A comparison of word graph and n-best list based confidence measures , 1999, EUROSPEECH.

[9]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[10]  Patrick Wambacq,et al.  Confidence scoring based on backward language models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Bhiksha Raj,et al.  A boosting approach for confidence scoring , 2001, INTERSPEECH.

[12]  Hiroyuki Abe,et al.  Exploiting passage retrieval for n-best rescoring of spoken questions , 2005, INTERSPEECH.

[13]  Stephanie Seneff,et al.  Reducing recognition error rate based on context relationships among dialogue turns , 2007, INTERSPEECH.

[14]  Qingwei Zhao,et al.  Improved Lattice-Based Confidence Measure for Speech Recognition via a Lattice Cutoff Procedure , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[15]  M. T. Nguyen,et al.  Word confidence measure based on frame likelihood score , 2008, Pattern Recognition and Image Analysis.

[16]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[17]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[18]  Delphine Charlet,et al.  On combining confidence measures for improved rejection of incorrect data , 2001, INTERSPEECH.

[19]  José B. Mariño,et al.  Contextual confidence measures for continuous speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Jean Paul Haton,et al.  Frame-synchronous and local confidence measures for on-the-fly keyword spotting , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[21]  Khalid Daoudi,et al.  Dynamic Bayesian networks for multi-band automatic speech recognition , 2003, Comput. Speech Lang..

[22]  Jean Paul Haton,et al.  Local word confidence measure using word graph and n-best list , 2005, INTERSPEECH.

[23]  Hu Context Constrained-Generalized Posterior Probability for Verifying Phone Transcriptions , 2007 .

[24]  Delphine Charlet,et al.  Confidence measure and incremental adaptation for the rejection of incorrect data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[25]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[26]  Sarel van Vuuren,et al.  Syllable lattices as a basis for a children's speech reading tracker , 2007, INTERSPEECH.

[27]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[28]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[30]  C. Uhrik,et al.  Confidence metrics based on n-gram language model backoff behaviors , 1997, EUROSPEECH.

[31]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[32]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[33]  Katarina Bartkova,et al.  Hypothesis dependent threshold setting for improved out-of-vocabulary data rejection , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[34]  Giuseppe Riccardi,et al.  Acoustic and word lattice based algorithms for confidence scores , 2002, INTERSPEECH.

[35]  Paul Deléglise,et al.  Automatic Detection of Well Recognized Words in Automatic Speech Transcriptions , 2006, LREC.

[36]  John H. L. Hansen,et al.  Phonetic Distance Based Confidence Measure , 2010, IEEE Signal Processing Letters.

[37]  Irina Illina,et al.  The automatic news transcription system: ANTS, some real time experiments , 2004, INTERSPEECH.

[38]  Stephen Cox,et al.  High-level approaches to confidence estimation in speech recognition , 2002, IEEE Trans. Speech Audio Process..

[39]  ANNA ESPOSITO,et al.  Cognitive Role of Speech Pauses and Algorithmic Considerations for their Processing , 2008, Int. J. Pattern Recognit. Artif. Intell..

[40]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[41]  Sargur N. Srihari,et al.  Comparison of ROC and Likelihood Decision Methods in Automatic Fingerprint Verification , 2008, Int. J. Pattern Recognit. Artif. Intell..

[42]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[43]  Jean Paul Haton,et al.  Comprehension improvement using local confidence measure: towards automatic transcription for classroom , 2008, WOCCI.