Utterance verification based on statistics of phone-level confidence scores

We present new acoustic confidence scores for utterance verification based on novel combinations of phone-level posterior probability statistics. A common utterance acoustic confidence score used in the literature is the arithmetic mean (computed over the utterance) of the phone log posterior probabilities. This approach can be problematic when a large part of the utterance is in-grammar (IG), but a small part is out-of-grammar (OOG). For example, a caller says an OOG name "Larry" and is incorrectly recognized as an IG name "Harry". Since most phones were correctly recognized, the mean of the phone posteriors gives a high utterance level score even though the recognition result should ideally be rejected. We introduce additional statistics, such as the variance and low percentile points of the phone-posterior scores over the utterance, that help in capturing the deviation of otherwise good recognition matches. We report on our experiments on combining these statistics. In particular, by normalizing the mean with the standard deviation, we achieved a 10-20% relative improvement in performance for alpha-digit test sets where OOG utterances are often incorrectly recognized as very similar IG ones.

[1]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Eric I. Chang Improving rejection with semantic slot-based confidence scores , 1999, EUROSPEECH.

[3]  Steve Renals,et al.  Confidence measures for hybrid HMM/ANN speech recognition , 1997, EUROSPEECH.

[4]  Michael Cohen,et al.  A phone-dependent confidence measure for utterance rejection , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Biing-Hwang Juang,et al.  Robust utterance verification for connected digits recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[7]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[8]  Vassilios Digalakis,et al.  Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers , 1996, IEEE Trans. Speech Audio Process..

[9]  Timothy J. Hazen,et al.  Word and phone level acoustic confidence scoring , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).