Models of English text

The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.

[1]  Renato De Mori,et al.  A cache based natural lan-guage model for speech recognition , 1992 .

[2]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[3]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[4]  R. Nigel Horspool,et al.  Constructing word-based text compression algorithms , 1992, Data Compression Conference, 1992..

[5]  Peter M. Fenwick A New Data Structure for Cumulative Probability Tables: An Improved Frequency‐to‐Symbol Algorithm , 1996 .

[6]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[7]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[8]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[9]  Harry R. Lewis,et al.  Data Structures and Their Algorithms , 1991 .

[10]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[11]  Gerald Salton,et al.  Automatic text processing , 1988 .

[12]  Applying Compression to Natural Language Processing , 1997 .

[13]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[15]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[16]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[18]  Richard E. Ladner,et al.  On-line stochastic processes in data compression , 1996 .

[19]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[20]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[21]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.