A hierarchical Dirichlet language model

We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as 'smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology. The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources.

[1]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[2]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[3]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[5]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[6]  J. Skilling Classic Maximum Entropy , 1989 .

[7]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[8]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[9]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[10]  Robin Hanson,et al.  Bayesian Classification with Correlation and Inheritance , 1991, IJCAI.

[11]  Geoffrey E. Hinton,et al.  Mean field networks that learn to discriminate temporally distorted strings , 1991 .

[12]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[13]  David S. Touretzky,et al.  Connectionist models : proceedings of the 1990 summer school , 1991 .

[14]  Lalit R. Bahl,et al.  A fast algorithm for deleted interpolation , 1991, EUROSPEECH.

[15]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[16]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[17]  Radford M. Neal Bayesian Mixture Modeling , 1992 .

[18]  M. West,et al.  Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[19]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[20]  Linda C. Bauman Peto A Comparison of Two Smoothing Methods for Word Bigram Models , 1994, ArXiv.

[21]  D. Mackay,et al.  Bayesian neural networks and density networks , 1995 .

[22]  D. Mackay,et al.  HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT? , 1996 .

[23]  MacKayCavendish,et al.  Models for Dice Factories and Amino Acid Probability Vectors . Draft 1 . 1 , .