Multiple-pronunciation lexical modeling in a speaker independent speech understanding system

Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current state-of-the-art speech recognition systems are capable of achieving word-level accuracies of 90% to 95% on continuous speech recognition tasks using 5000 words. Even larger systems, capable of recognizing 20,000 words are just now being developed. Speech understanding systems have recently been developed that perform fairly well within a restricted domain. While the size and performance of modern speech recognition and understanding systems are impressive, it is evident to anyone who has used these systems that the technology is primitive compared to our own human ability to understand speech. Some of the difficulties hampering progress in the fields of speech recognition and understanding stem from the many sources of variation that occur during human communication. One of the sources of variation that occurs in human communication is the different ways that words can be pronounced. There are many causes of pronunciation variation, such as: the phonetic environment in which the word occurs, the dialect of the speaker, the speaker's age, the speaker's gender, and the speaking rate. Some researchers have shown improvements in speech recognition performance on a read-speech task through the use of explicit pronunciation modeling, while others have not shown any significant improvements. This thesis presents an algorithm for the construction of models that attempt to capture the variation that occurs in the pronunciations of words in spontaneous (i.e., non-read) speech. A technique for developing alternate pronunciations of words and then estimating the probabilities of the alternate pronunciations is presented. Additionally, we describe the development and implementation of a spoken-language understanding system called the Berkeley Restaurant Project (BeRP). Multiple pronunciation word models constructed using the algorithm proposed in this thesis are evaluated within the context of the BeRP system. The results of this evaluation show that the explicit modeling of variation in the pronunciation of words improves the performance of both the speech recognition and the speech understanding components of the BeRP system.

[1]  John Makhoul,et al.  Continuous speech recognition results of the BYBLOS system on the DARPA 1000-word resource management database , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Hynek Hermansky,et al.  Experiments with temporal resolution for continuous speech recognition with multi-layer perceptrons , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[5]  Victor Zue,et al.  A procedure for automatic alignment of phonetic transcriptions with continuous speech , 1984, ICASSP.

[6]  Hy Murveit,et al.  Integrating Speech and Natural-Language Processing , 1989, HLT.

[7]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[8]  Nigel Gilbert,et al.  Simulating speech systems , 1991 .

[9]  Hy Murveit,et al.  Spontaneous Speech Effects In Large Vocabulary Speech Recognition Applications , 1992, HLT.

[10]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[11]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[12]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[13]  Hynek Hermansky,et al.  Continuous speech recognition using PLP analysis with multilayer perceptrons , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[15]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  D. Klatt Voice onset time, frication, and aspiration in word-initial consonant clusters. , 1975, Journal of speech and hearing research.

[17]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[18]  Mitch Weintraub,et al.  Reduced Channel Dependence for Speech Recognition , 1992, HLT.

[19]  Andreas Stolcke,et al.  The berkeley restaurant project , 1994, ICSLP.

[20]  Francine R. Chen Identification of contextual factors for pronunciation networks , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[21]  Harvey F. Silverman,et al.  Constraining model duration variance in HMM-based connected-speech recognition , 1993, EUROSPEECH.

[22]  H. Bourlard,et al.  Connectionist Speech Recognition: Status and Prospects , 1991 .

[23]  James K. Baker,et al.  Stochastic modeling as a means of automatic speech recognition. , 1975 .

[24]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[25]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[26]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[27]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[28]  Hynek Hermansky,et al.  Recognition of speech in additive and convolutional noise based on RASTA spectral processing , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Martin Kay,et al.  The MIND System , 1970 .

[30]  Richard M. Schwartz,et al.  Improved hidden Markov modeling of phonemes for continuous speech recognition , 1984, ICASSP.

[31]  Victor Zue,et al.  Language modelling for recognition and understanding using layered bigrams , 1992, ICSLP.

[32]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  William S.-Y. Wang Approaches to Phonology , 1973 .

[34]  Hans-Günter Hirsch,et al.  Improved speech recognition using high-pass filtering of subband envelopes , 1991, EUROSPEECH.

[35]  Jeff A. Bilmes,et al.  The Ring Array Processor: A Multiprocessing Peripheral for Connection Applications , 1992, J. Parallel Distributed Comput..

[36]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[37]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[38]  Hy Murveit,et al.  1000-word speaker-independent continuous-speech recognition using hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[39]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[40]  Hong C. Leung,et al.  The effects of signal representations, phonetic classification techniques, and the telephone network , 1992, ICSLP.

[41]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[42]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[43]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[44]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[45]  David Goodine,et al.  Full integration of speech and language understanding in the MIT spoken language system , 1991, EUROSPEECH.

[46]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[47]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[48]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Hervé Bourlard,et al.  Connectionist Approaches to the Use of Markov Models for Speech Recognition , 1990, NIPS.

[50]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Lotfi A. Zadeh,et al.  Phonological structures for speech recognition , 1989 .

[52]  Yochai Konig,et al.  A neural network based, speaker independent, large vocabulary, continuous speech recognition system: the WERNICKE project , 1993, EUROSPEECH.

[53]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[54]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[55]  Aaron E. Rosenberg,et al.  Automatic generation of phonetic units for continuous speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[56]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[57]  Daniel Jurafsky,et al.  An On-Line Computational Model of Human Sentence Interpretation , 1992, AAAI.

[58]  Marco Saerens,et al.  Performance comparison of hidden Markov models and neural networks for task dependent and independent isolated word recognition , 1993, EUROSPEECH.

[59]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[60]  James Glass,et al.  The VOYAGER speech understanding system: preliminary development and evaluation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[61]  Lalit R. Bahl,et al.  Further results on the recognition of a continuously read natural corpus , 1980, ICASSP.

[62]  Roger Moore,et al.  Experiences Collecting Genuine Spoken Enquiries using WOZ Techniques , 1992, HLT.

[63]  S. Roucos,et al.  The role of word-dependent coarticulatory effects in a phoneme-based speech recognition system , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[64]  K. Davis,et al.  Automatic Recognition of Spoken Digits , 1952 .

[65]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Esther Levin,et al.  Accelerated Learning in Layered Neural Networks , 1988, Complex Syst..

[67]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[68]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[69]  Hynek Hermansky,et al.  Towards handling the acoustic environment in spoken language processing , 1992, ICSLP.

[70]  Bell Labonories AUTOMATIC GENERATION OF PHONETIC UNITS FOR CONTINUOUS SPEECH RECOGNITION , 1989 .

[71]  Horacio Franco,et al.  Hybrid neural network/hidden Markov model continuous-speech recognition , 1992, ICSLP.

[72]  Hynek Hermansky,et al.  Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP) , 1991, EUROSPEECH.

[73]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[74]  Harold T. Edwards,et al.  Applied Phonetics: The Sounds of American English , 1992 .

[75]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[76]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[77]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[78]  Hervé Bourlard,et al.  Continuous speech recognition on the resource management database using connectionist probability estimation , 1990, ICSLP.