Scaling High-Order Character Language Models to Gigabytes

We describe the implementation steps required to scale high-order character language models to gigabytes of training data without pruning. Our online models build character-level PAT trie structures on the fly using heavily data-unfolded implementations of an mutable daughter maps with a long integer count interface. Terminal nodes are shared. Character 8-gram training runs at 200,000 characters per second and allows online tuning of hyperparameters. Our compiled models precompute all probability estimates for observed n-grams and all interpolation parameters, along with suffix pointers to speedup context computations from proportional to n-gram length to a constant. The result is compiled models that are larger than the training models, but execute at 2 million characters per second on a desktop PC. Cross-entropy on held-out data shows these models to be state of the art in terms of performance.

[1]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[2]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[3]  Frederick Jelinek,et al.  A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[4]  NeyHermann,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995 .

[5]  Christer Samuelsson,et al.  Handling Sparse Data by Successive Abstraction , 1996, COLING.

[6]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[7]  ZhaiChengxiang,et al.  A study of smoothing methods for language models applied to information retrieval , 2004 .

[8]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[9]  John G. Cleary,et al.  Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[10]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[15]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[16]  Richard A. O'Keefe,et al.  The Craft of Prolog , 1990 .

[17]  Peter Norvig,et al.  Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp , 1991 .

[18]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[19]  Bhiksha Raj,et al.  Quantization-based language model compression , 2001, INTERSPEECH.

[20]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[21]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[22]  Vasileios Hatzivassiloglou,et al.  Two-Level, Many-Paths Generation , 1995, ACL.

[23]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[24]  Hermann Ney,et al.  On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995, IEEE Trans. Pattern Anal. Mach. Intell..