KenLM: Faster and Smaller Language Model Queries

We present KenLM, a library that implements two data structures for efficient language model queries, reducing both time and memory costs. The Probing data structure uses linear probing hash tables and is designed for speed. Compared with the widely-used SRILM, our Probing model is 2.4 times as fast while using 57% of the memory. The Trie data structure is a trie with bit-level packing, sorted records, interpolation search, and optional quantization aimed at lower memory consumption. Trie simultaneously uses less memory than the smallest lossless baseline and less CPU than the fastest baseline. Our code is open-source, thread-safe, and integrated into the Moses, cdec, and Joshua translation systems. This paper describes the several performance techniques used and presents benchmarks against alternative implementations.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[3]  Alon Itai,et al.  Interpolation search—a log logN search , 1978, CACM.

[4]  Chris Callison-Burch,et al.  Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[7]  David Guthrie,et al.  Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval , 2010, EMNLP.

[8]  Ulrich Germann,et al.  Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too , 2009 .

[9]  Miles Osborne,et al.  Randomised Language Modelling for Statistical Machine Translation , 2007, ACL.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  James R. Glass,et al.  Iterative language model estimation: efficient data structure & algorithms , 2008, INTERSPEECH.

[12]  Marcello Federico,et al.  How Many Bits Are Needed To Store Probabilities for Phrase-Based Translation? , 2006, WMT@HLT-NAACL.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[15]  Bhiksha Raj,et al.  Lossless compression of language model structure and word identifiers , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..