A study of n-gram and decision tree letter language modeling methods

Abstract The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text generation process: the n -gram language model and various decision tree based language models. In the first part of the paper, we compare the most popular smoothing algorithms applied to the former. We conclude that the bottom-up deleted interpolation algorithm performs the best in the task of n -gram letter language model smoothing, significantly outperforming the back-off smoothing technique for large values of n . In the second part of the paper, we consider various decision tree development algorithms. Among them, a K -means clustering type algorithm for the design of the decision tree questions gives the best results. However, the n -gram language model outperforms the decision tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems to be naturally modeled by n -grams.

[1]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[2]  Eric Sven Ristad,et al.  New Techniques for Context Modeling , 1995, ACL.

[3]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[4]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Lalit R. Bahl,et al.  A fast algorithm for deleted interpolation , 1991, EUROSPEECH.

[7]  Eric Sven Ristad,et al.  Hierarchical Non-Emitting Markov Models , 1997, ACL.

[8]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[9]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[11]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[12]  Hermann Hild,et al.  Language models for a spelled letter recognizer , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[14]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[15]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[16]  NeyHermann,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995 .

[17]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[18]  Jack Perkins,et al.  Pattern recognition in practice , 1980 .

[19]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[20]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[21]  Hermann Ney,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[23]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[24]  Eric Sven Ristad,et al.  A natural law of succession , 1995, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).