A survey of smoothing techniques for ME models

In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in ME smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing n-gram language models. Because of the mature body of research in n-gram model smoothing and the close connection between ME and conventional n-gram models, this domain is well-suited to gauge the performance of ME smoothing methods. Over a large number of data sets, we find that fuzzy ME smoothing performs as well as or better than all other algorithms under consideration. We contrast this method with previous n-gram smoothing methods to explain its superior performance.

[1]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[2]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[3]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[4]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[5]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[6]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[7]  William I. Newman,et al.  Extension to the maximum entropy method , 1977, IEEE Trans. Inf. Theory.

[8]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[9]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[13]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[14]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[15]  Robert L. Mercer,et al.  A Statistical Approach to Sense Disambiguation in Machine Translation , 1991, HLT.

[16]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[17]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[18]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[19]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Ronald Rosenfeld,et al.  Adaptive Language Modeling Using the Maximum Entropy Principle , 1993, HLT.

[21]  Raymond Lau,et al.  Adaptive statistical language modeling , 1994 .

[22]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[23]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[24]  John D. Lafferty,et al.  Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models , 1995, ArXiv.

[25]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[26]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[27]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[28]  John D. Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, ACL.

[29]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Richard M. Stern,et al.  The 1997 CMU Sphinx-3 English Broadcast News Transcription System , 1997 .

[31]  Alexander G. Hauptmann,et al.  Experiments in Spoken Document Retrieval at CMU , 1997, TREC.

[32]  Robert Miller,et al.  Just-in-time language modelling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[33]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[34]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[35]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[36]  William H. Press,et al.  Numerical recipes in C , 2002 .