When adapting an existing ASR-application for different user environments, one often gets confronted with speech that does not entirely match the training situation. Differences may stem both from acoustic and linguistic causes. In this paper we explore to what extent the word correct rate (wcr) for a given test set can be predicted from the transcription only (i.e. the linguistic representation) under the assumption that acoustic conditions are matched. We hope that, eventually, such a prediction can provide an estimate of a lower bound on wer to aim for when applying acoustic enhancement techniques. We propose and compute measures for acoustic and linguistic confusability (AC and LC) of each entry in the vocabulary of an ASR engine. Using a tabulation of how correctness of actual recognition on a development set varies as a function of these confusability measures, we show that actually observed w cr of words from independent test sets can be predicted with high accuracy over the full ranges of AC and LC levels.
[1]
Benoît Maison,et al.
Automatic generation and selection of multiple pronunciations for dynamic vocabularies
,
2001,
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[2]
Lou Boves,et al.
Accumulated kullback divergence for analysis of ASR performance in the presence of noise
,
2002,
INTERSPEECH.
[3]
Lou Boves,et al.
A spoken dialog system for the Dutch public transport information service
,
1997,
Int. J. Speech Technol..
[4]
Alex Acero,et al.
Estimating speech recognition error rate without acoustic test data
,
2003,
INTERSPEECH.
[5]
Peder A. Olsen,et al.
Theory and practice of acoustic confusability
,
2002,
Comput. Speech Lang..