Phone-duration-based confidence measures for embedded applications

In order to detect misrecognitions that may result from a mismatch between training and testing data, we use a confidence measure (CM) that collects a set of features during recognition and from the N-best list that is output by the recognizer. A neural network (NN) then calculates the probability that the utterance was recognized correctly based on these features. Since for misrecognized utterances the resulting phoneme alignments are often erroneous, we introduced some new features that are based on phoneme durations. The durations found by the recognizer are compared to the durations present in the training data base and the results of these comparisons serve as input for the NN. A great advantage of the duration-related features is that they are independent of the recognizer in contrast to e.g. acoustic scorebased features. We also use some score-related features that have proven to be useful in the past. Simultaneously with determining the confidence for a recognition result, we try to detect if in case of a misrecognition the utterance was an out of vocabulary (OOV) utterance. Using the complete set of 46 features we can achieve a correct classification rate of 90%. The word error rate can be reduced by 92% at a false rejection rate of 5.1% on a test task that consists of 35 speakers and includes more than 50% OOV utterances. OOV words were detected correctly in 91% of the cases. The presented CM is also used in a semi-supervised speaker adaptation scheme.

[1]  Andreas Wendemuth,et al.  Advances in confidence measures for large vocabulary , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Ralf Kompe,et al.  A Combined MAP + MLLR Approach for Speaker Adaptation , 2002 .

[3]  Krzysztof Marasek,et al.  Prosodically Motivated Features for Confidence Measures , 2000 .

[4]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[5]  Ralf Kompe,et al.  A MAP-like weighting scheme for MLLR speaker adaptation , 1999, EUROSPEECH.

[6]  Ralf Kompe,et al.  Prosody in Speech Understanding Systems , 1997, Lecture Notes in Computer Science.