Distant bigram language modelling using maximum entropy

Applies the maximum entropy approach to so-called distant bigram language modelling. In addition to the usual unigram and bigram dependencies, we use distant bigram dependencies, where the immediate predecessor word of the word position under consideration is skipped. We analyze the computational complexity of the resulting training algorithm, i.e. the generalized iterative scaling (GIS) algorithm, and study the details of its implementation. We describe a method for handling unseen events in the maximum entropy approach; this is achieved by discounting the frequencies of observed events. We study the effect of this discounting operation on the convergence of the GIS algorithm. We give experimental perplexity results for a corpus from the Wall Street Journal (WSJ) task. By using the maximum entropy approach and the distant bigram dependencies, we are able to reduce the perplexity from 205.4 for our best conventional bigram model to 169.5.

[1]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[2]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[5]  John D. Lafferty,et al.  Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models , 1995, ArXiv.

[6]  Sven C. Martin,et al.  Statistical Language Modeling Using Leaving-One-Out , 1997 .

[7]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[8]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[9]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .