Faster and Smaller N-Gram Language Models

N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.

[1]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[2]  S. Golomb Run-length encodings. , 1966 .

[3]  Bhiksha Raj,et al.  Quantization-based language model compression , 2001, INTERSPEECH.

[4]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[5]  Bernard Chazelle,et al.  The Bloomier filter: an efficient data structure for static support lookup tables , 2004, SODA '04.

[6]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[7]  Sebastiano Vigna,et al.  Codes for the World Wide Web , 2005, Internet Math..

[8]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[9]  Kenneth Ward Church,et al.  Compressing Trigram Language Models With Golomb Coding , 2007, EMNLP.

[10]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[11]  Mauro Cettolo,et al.  Efficient Handling of N-gram Language Models for Statistical Machine Translation , 2007, WMT@ACL.

[12]  Sanjeev Khudanpur,et al.  A Scalable Decoder for Parsing-Based Machine Translation with Equivalent Language Model State Maintenance , 2008, SSST@ACL.

[13]  James R. Glass,et al.  Iterative language model estimation: efficient data structure & algorithms , 2008, INTERSPEECH.

[14]  Chris Callison-Burch,et al.  Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[15]  Miles Osborne,et al.  Stream-based Randomised Language Models for SMT , 2009, EMNLP.

[16]  Back-off language model compression , 2009, INTERSPEECH.

[17]  Ulrich Germann,et al.  Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too , 2009 .

[18]  David Guthrie,et al.  Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval , 2010, EMNLP.