Scalable Modified Kneser-Ney Language Model Estimation

We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.

[1]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jeffrey Scott Vitter External memory algorithms , 1998, PODS '98.

[5]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[6]  Bhiksha Raj,et al.  Quantization-based language model compression , 2001, INTERSPEECH.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11]  Miles Osborne,et al.  Randomised Language Modelling for Statistical Machine Translation , 2007, ACL.

[12]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[13]  Jianfeng Gao,et al.  MSRLM: a Scalable Language Modeling Toolkit , 2007 .

[14]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[15]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[16]  Peter Sanders,et al.  STXXL: standard template library for XXL data sets , 2008, Softw. Pract. Exp..

[17]  David Guthrie,et al.  Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval , 2010, EMNLP.

[18]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[19]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[20]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[21]  Alon Lavie,et al.  Language Model Rest Costs and Space-Efficient Storage , 2012, EMNLP.

[22]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[23]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[24]  Nadir Durrani,et al.  Edinburgh’s Machine Translation Systems for European Language Pairs , 2013, WMT@ACL.

[25]  Ciprian Chelba,et al.  Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search , 2013 .

[26]  Christos Gkantsidis,et al.  Nobody ever got fired for buying a cluster , 2013 .