An Experimental Study on Confidence Measures for Robust Speech Recognition

One of the most critical components in a practical speech recognition system is a reliable confidence measure. In this paper, we report a number of experiments we conducted to improve confidence measures for large-vocabulary speaker-independent speech recognition. We first studied the behavior of confidence measures for mispronounced words during the user enrollment phase. Acoustic features at word, phoneme and senone level were examined. We developed a transformation function based system using sub-word features for high performance confidence estimation. Discriminative training was used to optimize the parameters of the transformation function. In comparison to the baseline system, our experiments show that the proposed system reduced the equal error rate by 15% and the false acceptance error by 40% at a number of fixed false rejection rates. Secondly, we augmented our feature vectors for speech recognition error detection. With multi-dimensional features and a linear classifier, our experiments show that the false acceptance error can be reduced by 80% in comparison with our single feature baseline system. Finally, we investigated how we could use confidence measures to reject noise. With our explicit noise modeling and a secondary classifier, we have reduced the noise rejection error down to 7% - a 68% error reduction over our baseline system.

[1]  Mazin G. Rahim,et al.  Discriminative utterance verification using multiple confidence measures , 1997, EUROSPEECH.

[2]  J. Makhoul,et al.  Automatic modeling for adding new words to a large-vocabulary continuous speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[4]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Biing-Hwang Juang,et al.  Discriminative utterance verification using minimum string verification error (MSVE) training , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Lin Lawrence Chase,et al.  Word and acoustic confidence annotation for large vocabulary speech recognition , 1997, EUROSPEECH.

[8]  Michael Cohen,et al.  A phone-dependent confidence measure for utterance rejection , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Mei-Yuh Hwang,et al.  Microsoft Windows highly intelligent speech recognizer: Whisper , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Mei-Yuh Hwang,et al.  Improvements on the pronunciation prefix tree search organization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Chin-Hui Lee,et al.  Utterance verification of keyword strings using word-based minimum verification error (WB-MVE) training , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Herbert Gish,et al.  Understanding and improving speech recognition performance through the use of diagnostic tools , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.