N-gram Counts and Language Models from the Common Crawl

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English $5$-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5-1.4 BLEU by using large language models to translate into various languages.

[1]  References , 1971 .

[2]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[9]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[10]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[11]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[12]  Dekang Lin,et al.  Creating Robust Supervised Classifiers via Web-Scale N-Gram Data , 2010, ACL.

[13]  Kenneth Ward Church,et al.  Unsupervised Acquisition of Lexical Knowledge From N-grams : Final Report of the 2009 JHU CLSP Workshop , 2010 .

[14]  David Guthrie,et al.  Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval , 2010, EMNLP.

[15]  Xiaolong Li,et al.  An Overview of Microsoft Web N-gram Corpus and Applications , 2010, NAACL.

[16]  Philipp Koehn,et al.  Sparse lexicalised features and topic adaptation for SMT , 2012, IWSLT.

[17]  Ciprian Chelba,et al.  Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search , 2013 .

[18]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[19]  Nadir Durrani,et al.  Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT? , 2013, ACL.

[20]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[21]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[22]  Nadir Durrani,et al.  Edinburgh SLT and MT System Description for the IWSLT 2014 Evaluation , 2013 .

[23]  Nadir Durrani,et al.  Integrating an Unsupervised Transliteration Model into Statistical Machine Translation , 2014, EACL.