Improving continuous speech recognition with automatic multiple pronunciation support

Conventional computer speech recognition systems use models of speech acoustics and the language of the recognition task in order to perform recognition. For all but trivial recognition tasks, sub-word units are modeled, typically phonemes. Recognizing words then requires a pronunciation dictionary ( PD) to specify how each word is pronounced in terms of the units modeled. Even if the acoustic modeling component is perfect, the recognizer will still be prone to misrecognition, most often because the speaker can use a pronunciation other than that in the PD. This different pronunciation may be due to the speaker being a non-native speaker of the language being recognized, having ‘mispronounced’ the word, coarticulatory effects, recognizer errors in phoneme hypothesization, or any combination of these. One way to overcome these misrecognitions is to use a dynamic PD, able to acquire new pronunciations for words as they are encountered and misrecognized. The thesis examines the following questions: can automated methods be found that produce reliable alternate pronunciations? If so, does augmenting a PD (which originally contains only canonical pronunciations) with these alternate pronunciations lead to improved recognizer performance? It shows that using even simple methods, average reductions in word error rate of at least 45% are possible, even with speakers who are not native speakers of the recognition task language.

[1]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[2]  John H. L. Hansen,et al.  Text-directed speech enhancement using phoneme classification and feature map constrained vector quantization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  D. B. Paul Training of HMM recognizers by simulated annealing , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  C. Coker A dictionary‐intensive letter‐to‐sound program , 1985 .

[5]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.

[6]  Andreas Stolcke,et al.  The berkeley restaurant project , 1994, ICSLP.

[7]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Biing-Hwang Juang,et al.  HMM clustering for connected word recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  John H. L. Hansen,et al.  Foreign accent classification using source generator based prosodic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11]  M. Finke,et al.  Pronunciation modelling for conversational speech recognition: a status report from WS97 , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12]  Tilo Sloboda Dictionary learning: performance through consistency , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  K. Harris,et al.  Stress and rate: differential transformations of articulation. , 1982, The Journal of the Acoustical Society of America.

[14]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[15]  Frederick Jelinek,et al.  Self-organizing language modeling for speech recognition , 1990 .

[16]  Michael Picheny,et al.  A method for the construction of acoustic Markov models for words , 1993, IEEE Trans. Speech Audio Process..

[17]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Michael Galler,et al.  On the use of stochastic inference networks for representing multiple word pronunciations , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  R. Port Linguistic timing factors in combination. , 1981, The Journal of the Acoustical Society of America.

[20]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[21]  Jared J. Wolf Speech Recognition and Understanding , 1980 .

[22]  Adrian Akmajian,et al.  Linguistics: An Introduction to Language and Communication , 1979 .