论文信息 - N-gram Counts and Language Models from the Common Crawl

N-gram Counts and Language Models from the Common Crawl

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English $5$-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5-1.4 BLEU by using large language models to translate into various languages.

[1] References , 1971 .

[2] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[4] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5] Shankar Kumar,et al. Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[6] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8] Philipp Koehn,et al. Factored Translation Models , 2007, EMNLP.

[9] Thorsten Brants,et al. Large Language Models in Machine Translation , 2007, EMNLP.

[10] David Chiang,et al. Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[11] Mauro Cettolo,et al. IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[12] Dekang Lin,et al. Creating Robust Supervised Classifiers via Web-Scale N-Gram Data , 2010, ACL.

[13] Kenneth Ward Church,et al. Unsupervised Acquisition of Lexical Knowledge From N-grams : Final Report of the 2009 JHU CLSP Workshop , 2010 .

[14] David Guthrie,et al. Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval , 2010, EMNLP.