An empirical study of smoothing techniques for language modeling

We survey the most widely-used algorithms for smoothing models for language n -gram modeling. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980); Katz (1987); Bell, Cleary and Witten (1990); Ney, Essen and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g. Brown vs. Wall Street Journal), count cutoffs, and n -gram order (bigram vs. trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. We find that these factors can significantly affect the relative performance of models, with the most significant factor being training data size. Since no previous comparisons have examined these factors systematically, this is the first thorough characterization of the relative performance of various algorithms. In addition, we introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser?Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented.

[1]  W. E. Johnson I.—PROBABILITY: THE DEDUCTIVE AND INDUCTIVE PROBLEMS , 1932 .

[2]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[3]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[4]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[5]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[7]  Arthur Nádas,et al.  On Turing's formula for word probabilities , 1985, IEEE Trans. Acoust. Speech Signal Process..

[8]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[9]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, Applied Natural Language Processing Conference.

[10]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[12]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[13]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[14]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[15]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[16]  Hermann Ney,et al.  On smoothing techniques for bigram-based natural language modelling , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[17]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[18]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[19]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[21]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[22]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.