论文信息 - Scaling High-Order Character Language Models to Gigabytes

Scaling High-Order Character Language Models to Gigabytes

We describe the implementation steps required to scale high-order character language models to gigabytes of training data without pruning. Our online models build character-level PAT trie structures on the fly using heavily data-unfolded implementations of an mutable daughter maps with a long integer count interface. Terminal nodes are shared. Character 8-gram training runs at 200,000 characters per second and allows online tuning of hyperparameters. Our compiled models precompute all probability estimates for observed n-grams and all interpolation parameters, along with suffix pointers to speedup context computations from proportional to n-gram length to a constant. The result is compiled models that are larger than the training models, but execute at 2 million characters per second on a desktop PC. Cross-entropy on held-out data shows these models to be state of the art in terms of performance.

Bob Carpenter | B. Carpenter | Bob Carpenter

[1] John G. Cleary,et al. Unbounded Length Contexts for PPM , 1997 .

[2] David J. C. MacKay,et al. A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[3] Frederick Jelinek,et al. A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[4] NeyHermann,et al. On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995 .

[5] Christer Samuelsson,et al. Handling Sparse Data by Successive Abstraction , 1996, COLING.

[6] Alistair Moffat,et al. Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[7] ZhaiChengxiang,et al. A study of smoothing methods for language models applied to information retrieval , 2004 .

[8] Dan Klein,et al. Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[9] John G. Cleary,et al. Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[10] Michele Banko,et al. Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[11] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[12] Lucian Vlad Lita,et al. tRuEcasIng , 2003, ACL.

[13] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14] Eric Brill,et al. An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[15] Robert L. Mercer,et al. Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[16] Richard A. O'Keefe,et al. The Craft of Prolog , 1990 .

[17] Peter Norvig,et al. Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp , 1991 .

[18] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[19] Bhiksha Raj,et al. Quantization-based language model compression , 2001, INTERSPEECH.

[20] John G. Cleary,et al. The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[21] William John Teahan,et al. Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[22] Vasileios Hatzivassiloglou,et al. Two-Level, Many-Paths Generation , 1995, ACL.

[23] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[24] Hermann Ney,et al. On the Estimation of 'Small' Probabilities by Leaving-One-Out , 1995, IEEE Trans. Pattern Anal. Mach. Intell..