Finding consensus in speech recognition

This thesis explores new ways of utilizing the information existing in word lattices produced by speech recognition systems to improve the accuracy of the recognition output and obtain a more perspicuous representation of a set of alternative hypotheses. We change the standard problem formulation of searching among a large set of sentence hypotheses to a local search in a small set of word candidates. Our approach replaces sentence-level posterior probabilities with word-level posteriors as the objective function for speech recognition, corresponding to the word-based error metric commonly used. The core of the method is a clustering procedure that identifies mutually supporting and competing word hypotheses in a lattice, constructing a total order over all word hypotheses. Together with word posterior probabilities computed from recognizer scores, this allows an efficient extraction of the hypothesis that is expected to minimize the word error rate. Our approach thus overcomes the mismatch between the word-based performance metric and the standard MAP scoring paradigm which is sentence-based, that can lead to sub-optimal recognition results. We also show that our method can be used as an efficient lattice compression technique. Its success comes from the ability to discard links with low a posteriori probability and recombine the remaining ones to create a new set of hypotheses. Experiments on the Switchboard corpus and Broadcast News show that this approach results in significant word error rate reductions, both over the standard MAP approach and compared to a previous word error minimization technique based on N-best lists. We also report significant decrease in lattice size when compared with the conventionally used technique. In essence, our method is an estimator of word posterior probabilities, and as such could benefit a number of other tasks like word spotting and confidence annotation.

[1]  Mitchel Weintraub,et al.  LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  D. Gusfield Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993 .

[3]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[4]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[5]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[8]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[9]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.