Adaptive statistical language modeling

The trigram statistical language model is remarkably successful when used in such applications as speech recognition. However, the trigram model is static in that it only considers the previous two words when making a prediction about a future word. The work presented here attempts to improve upon the trigram model by considering additional contextual and longer distance information. This is frequently referred to in the literature as adaptive statistical language modelling because the model is thought of as adapting to the longer term information. This work considers the creation of topic specific models, statistical evidence from the presence or absence of triggers, or related words, in the document history (document triggers) and in the current sentence (in-sentence triggers), and the incorporation of the document cache, which predicts the probability of a word by considering its frequency in the document history. An important result of this work is that the presence of self-triggers, that is, whether or not the word itself occurred in the document history, is an extremely important piece of information. A maximum entropy (ME) approach will be used in many instances to incorporate information from different sources. Maximum entropy considers a model which maximizes entropy while satisfying the constraints presented by the information we wish to incorporate. The generalized iterative scaling (GIS) algorithm can be used to compute the maximum entropy solution. This work also considers various methods of smoothing the information in a maximum entropy model. An inportant result is that smoothing improves performance noticibly and that Good-Turing discounting is an effective method of smoothing. Thesis Supervisor: Victor Zue Title: Principal Research Scientist, Department of Electrical Engineering and Computer Science

[1]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[4]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[5]  S. Kullback,et al.  The Information in Contingency Tables , 1980 .

[6]  Wayne H. Ward Evaluation of the CMU ATIS System , 1991, HLT.

[7]  Wayne H. Ward,et al.  The CMU Air Travel Information Service: Understanding Spontaneous Speech , 1990, HLT.

[8]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[9]  David J. Goodman,et al.  Personal Communications , 1994, Mobile Communications.

[10]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[11]  J. A. Romero The information in contingency tables , 1979 .

[12]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ronald Rosenfeld,et al.  Improvements in Stochastic Language Modeling , 1992, HLT.

[14]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[15]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  D. J. Daly,et al.  1. Summary and conclusions , 1976 .

[17]  I. Good Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables , 1963 .