CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data
暂无分享,去创建一个
[1] O. Togao,et al. The 51st Annual Meeting of The Japanese Society of Neuroradiology 18 – 19 February 2022 , 2022, Neuroradiology.
[2] Ali Furkan Biten,et al. OCR-IDL: OCR Annotations for Industry Document Library Dataset , 2022, ECCV Workshops.
[3] Laurent Romary,et al. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus , 2022, LREC.
[4] Xilun Chen,et al. CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training , 2021, NAACL-HLT.
[5] Alexandra Luccioni,et al. What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus , 2021, ACL.
[6] Chen Liang,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.
[7] Jesse Dodge,et al. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.
[8] Furu Wei,et al. LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.
[9] Tomasz Dwojak,et al. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.
[10] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[11] Furu Wei,et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.
[12] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.
[13] Furu Wei,et al. DocBank: A Benchmark Dataset for Document Layout Analysis , 2020, COLING.
[14] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[15] Patrick Paroubek,et al. NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports , 2020, LREC.
[16] Chris Mattmann,et al. Research Report: Building a Wide Reach Corpus for Secure Parser Development , 2020, 2020 IEEE Security and Privacy Workshops (SPW).
[17] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.
[18] Lukasz Garncarek,et al. LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.
[19] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[20] Furu Wei,et al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.
[21] Holger Schwenk,et al. CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web , 2019, ACL.
[22] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.
[23] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[24] Ming-Wei Chang,et al. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .
[25] Antonio Jimeno-Yepes,et al. PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).
[26] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[27] Doug Downey,et al. Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.
[28] Iryna Gurevych,et al. C4Corpus: Multilingual Web-size Corpus with Free License , 2016, LREC.
[29] Konstantinos G. Derpanis,et al. Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).
[30] Philipp Koehn,et al. Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.
[31] Shlomo Argamon,et al. Building a test collection for complex document information processing , 2006, SIGIR.
[32] James R. Curran,et al. Web Text Corpus for Natural Language Processing , 2006, EACL.
[33] M. Turski,et al. DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.
[34] Adam Kilgarriff,et al. of the European Chapter of the Association for Computational Linguistics , 2006 .