In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the training and the evaluation criterion. Recently, algorithms for minimizing directly the word error and other task specific error criterions have been proposed. This paper presents an extension of the ROVER algorithm for combining outputs of multiple speech recognizers using both a word error criterion and a sentence error criterion. The algorithm has been evaluated on the 1998 and 1999 broadcast news evaluation test sets, as well as the SDR 1999 speech recognition 10 hour subset and consistently outperformed the standard ROVER algorithm. The approach seems to be of particular interest for improving the recognition performance by combining only two or three speech recognizers achieving relative performance improvements of up to 20% compared to the best single recognizer.
[1]
Andreas Stolcke,et al.
Finding consensus among words: lattice-based word error minimization
,
1999,
EUROSPEECH.
[2]
Mark Liberman,et al.
THE TDT-2 TEXT AND SPEECH CORPUS
,
1999
.
[3]
Jonathan G. Fiscus,et al.
1997 BROADCAST NEWS BENCHMARK TEST RESULTS: ENGLISH AND NON-ENGLISH
,
1997
.
[4]
Mitch Weintraub,et al.
Explicit word error minimization in n-best list rescoring
,
1997,
EUROSPEECH.
[5]
Vaibhava Goel,et al.
Minimum Bayes-risk automatic speech recognition
,
2000,
Comput. Speech Lang..
[6]
Karen Spärck Jones,et al.
TREC-6 1997 Spoken Document Retrieval Track Overview and Results
,
1997,
TREC.
[7]
Jonathan G. Fiscus,et al.
REDUCED WORD ERROR RATES
,
1997
.
[8]
Jean-Luc Gauvain,et al.
Fast decoding for indexation of broadcast data
,
2000,
INTERSPEECH.
[9]
Jonathan G. Fiscus,et al.
1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures
,
1998
.