CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.

[1]  O. Togao,et al.  The 51st Annual Meeting of The Japanese Society of Neuroradiology 18 – 19 February 2022 , 2022, Neuroradiology.

[2]  Ali Furkan Biten,et al.  OCR-IDL: OCR Annotations for Industry Document Library Dataset , 2022, ECCV Workshops.

[3]  Laurent Romary,et al.  Towards a Cleaner Document-Oriented Multilingual Crawled Corpus , 2022, LREC.

[4]  Xilun Chen,et al.  CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training , 2021, NAACL-HLT.

[5]  Alexandra Luccioni,et al.  What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus , 2021, ACL.

[6]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[7]  Jesse Dodge,et al.  Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.

[8]  Furu Wei,et al.  LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.

[9]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[10]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[11]  Furu Wei,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[12]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[13]  Furu Wei,et al.  DocBank: A Benchmark Dataset for Document Layout Analysis , 2020, COLING.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Patrick Paroubek,et al.  NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports , 2020, LREC.

[16]  Chris Mattmann,et al.  Research Report: Building a Wide Reach Corpus for Secure Parser Development , 2020, 2020 IEEE Security and Privacy Workshops (SPW).

[17]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[18]  Lukasz Garncarek,et al.  LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.

[19]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[20]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[21]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web , 2019, ACL.

[22]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[23]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[25]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Doug Downey,et al.  Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.

[28]  Iryna Gurevych,et al.  C4Corpus: Multilingual Web-size Corpus with Free License , 2016, LREC.

[29]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[30]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[31]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[32]  James R. Curran,et al.  Web Text Corpus for Natural Language Processing , 2006, EACL.

[33]  M. Turski,et al.  DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[34]  Adam Kilgarriff,et al.  of the European Chapter of the Association for Computational Linguistics , 2006 .