Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems

In this paper we present some of the algorithm improvements that have been made to Dragon's continuous speech recognition and training programs, improvements that have more than halved our error rate on the Resource Management task since the last SLS meeting in February 1991. We also report the "dry run" results that we have obtained on the 5000-word speaker-dependent Wall Street Journal recognition task, and outline our overall research strategy and plans for the future.In our system, a set of output distributions, known as the set of PELs (phonetic elements), is associated with each phoneme. The HMM for a PIC (phoneme-in-context) is represented as a linear sequence of states, each having an out-put distribution chosen from the set of PELs for the given phoneme, and a (double exponential) duration distribution.In this paper we report on two methods of acoustic modeling and training. The first method involves generating a set of (unimodal) PELs for a given speaker by clustering the hypothetical frames found in the spectral models for that speaker, and then constructing speaker-dependent PEL sequences to represent each PIC. The "spectral model" for a PIC is simply the expected value of the sequence of frames that would be generated by the PIC. The second method represents the probability distribution for each parameter in a PEL as a mixture of a fixed set of unimodal components, the mixing weights being estimated using the EM algorithm. In both models we assume that the parameters are statistically independent.We report results obtained using each of these two methods (RePELing/Respelling and univariate "tied mixtures") on the 5000-word closed-vocabulary verbalized punctuation version of the Wall Street Journal task.

[1]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Paul Bamberg,et al.  Phoneme-in-Context Modeling for Dragon's Continuous Speech Recognizer , 1990, HLT.

[3]  Robert Roth,et al.  A Rapid Match Algorithm for Continuous Speech Recognition , 1990, HLT.

[4]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Dimitri Kanevsky,et al.  Matrix fast match: a fast method for identifying a short list of candidate words for decoding , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6]  Mark A. Mandel,et al.  Adaptable phoneme-based models for large-vocabulary speech recognition , 1991, Speech Commun..

[7]  Douglas B. Paul New Results with the Lincoln Tied-Mixture HMM CSR System , 1991, HLT.

[8]  Paul Bamberg,et al.  The Dragon Continuous Speech Recognition System: A Real-Time Implementation , 1990, HLT.

[9]  Larry Gillick,et al.  Rapid Match Training for Large Vocabularies , 1992, HLT.

[10]  Xavier L. Aubert Fast look-ahead pruning strategies in continuous speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Mei-Yuh Hwang,et al.  The SPHINX speech recognition system , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  Michael Picheny,et al.  Large vocabulary natural language continuous speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Lori Lamel,et al.  DRAGON Systems Resource Management Benchmark Results February 1991 , 1991, HLT.

[15]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[16]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech recognition , 1989 .