USING AUDIO QUALITY TO PREDICT WORD ERROR RATE IN AN AUTOMATIC SPEECH RECOGNITION SYSTEM

Faced with a backlog of audio recordings, users of automatic speech recognition (ASR) systems would benefit from the ability to predict which files would result in useful output transcripts in order to prioritize processing resources. ASR systems used in non-research environments typically run in “real time”. In other words, one hour of speech requires one hour of processing. These systems produce transcripts with widely varying Word Error Rates (WER) depending upon the actual words spoken and the quality of the recording. Existing correlations between WER and the ability to perform tasks such as information retrieval or machine translation could be leveraged if one could predict WER before processing an audio file. We describe here a method for estimating the quality of the ASR output transcript by predicting the portion of the total WER in a transcript attributable to the quality of the audio recording.

[1]  Karen Spärck Jones,et al.  The Cambridge University spoken document retrieval system , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Karen Sparck Jones,et al.  Spoken Document Retrieval for TREC-8 at Cambridge University , 1998, TREC.

[3]  Michael Picheny,et al.  Noise robustness in speech to speech translation , 2003, INTERSPEECH.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.