Joint estimation of confidence and error causes in speech recognition

Speech recognition errors are essentially unavoidable under the severe conditions of real fields, and so confidence estimation, which scores the reliability of a recognition result, plays a critical role in the development of speech recognition based real-field application systems. However, if we are to develop an application system that provides a high-quality service, in addition to achieving accurate confidence estimation, we also need to extract and exploit further supplementary information from a speech recognition engine. As a first step in this direction, in this paper, we propose a method for estimating the confidence of a recognition result while jointly detecting the causes of recognition errors based on a discriminative model. The confidence of a recognition result and the nonexistence/existence of error causes are naturally correlated. By directly capturing these correlations between the confidence and error causes, the proposed method enhances its estimation performance for the confidence and each error cause complementarily. In the initial speech recognition experiments, the proposed method provided higher confidence estimation accuracy than a discriminative model based state-of-the-art confidence estimation method. Moreover, the effective estimation mechanism of the proposed method was confirmed by the detailed analyses.

[1]  大西 仁,et al.  Pearl, J. (1988, second printing 1991). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufmann. , 1994 .

[2]  Hynek Hermansky,et al.  Posterior-based out of vocabulary word detection in telephone speech , 2009, INTERSPEECH.

[3]  Néstor Becerra Yoma,et al.  Bayes-based confidence measure in speech recognition , 2005, IEEE Signal Processing Letters.

[4]  Atsunori Ogawa,et al.  Simultaneous estimation of confidence and error cause in speech recognition using discriminative model , 2009, INTERSPEECH.

[5]  Dong Yu,et al.  An introduction to voice search , 2008, IEEE Signal Processing Magazine.

[6]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[7]  Patrick Gros,et al.  CRF-based combination of contextual features to improve a posteriori word-level confidence measures , 2010, INTERSPEECH.

[8]  Lin Lawrence Chase,et al.  Word and acoustic confidence annotation for large vocabulary speech recognition , 1997, EUROSPEECH.

[9]  Timothy J. Hazen,et al.  A comparison and combination of methods for OOV word detection and word confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Dong Yu,et al.  Using continuous features in the maximum entropy model , 2009, Pattern Recognit. Lett..

[11]  Tetsunori Kobayashi,et al.  Extensible speech recognition system using proxy-agent , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  Atsunori Ogawa,et al.  A novel confidence measure based on marginalization of jointly estimated error cause probabilities , 2010, INTERSPEECH.

[13]  Chin-Hui Lee,et al.  Verifying and correcting recognition string hypotheses using discriminative utterance verification , 1997, Speech Commun..

[14]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[17]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[18]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Alex Acero,et al.  Maximum Entropy Confidence Estimation for Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Wei Zhang,et al.  The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks , 2013, Comput. Speech Lang..

[23]  Sheryl R. Young,et al.  Detecting misrecognitions and out-of-vocabulary words , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Atsunori Ogawa,et al.  Discriminative confidence and error cause estimation for extended speech recognition function , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Francoise Beaufays,et al.  “Your Word is my Command”: Google Search by Voice: A Case Study , 2010 .