论文信息 - The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset - 字舞流文

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

David Ifeoluwa Adelani | Teven Le Scao | Leandro von Werra | Stella Rose Biderman | Javier de la Rosa | Pedro Ortiz Suarez | Albert Villanova del Moral | Kyle Lo | Yacine Jernite | Margaret Mitchell | Aitor Soroa Etxabe | Itziar Gonzalez-Dios | Anna Rogers | Tristan Thrush | Aaron Gokaslan | Jian Zhu | S. Longpre | Olivier Nguyen | Zaid Alyafeai | Manan Dey | Thomas Wang | Leon Weber | Sasha Luccioni | Pierre Colombo | Jenny Chim | Jorg Frohberg | Huu Nguyen | Maraim Masoud | Gérard Dupont | Somaieh Nikpoor | Christopher Akiki | F. Toni | Daniel Alexander van Strien | Shamik Bose | Hugo Laurenccon | Paulo Villegas | Quentin Lhoest | Lucile Saulnier | Long Phan | Angelina McMillan-Major | Chenghao Mou | Giada Pistilli | Khalid Almubarak | Mario vSavsko | Minh Chien Vu | Sebastian Nagel | S. Pai | Violette Lepercq | Loubna Ben Allal | I. Yu | H. Tran | E. G. Ponferrada | M. Muñoz | Suzana Ilic

[1] Pablo González de Prado Salas,et al. BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling , 2022, Proces. del Leng. Natural.

[2] Dragomir R. Radev,et al. Data Governance in the Age of Large-Scale Data-Driven Language Technology , 2022, FAccT.

[3] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[4] Stella Rose Biderman,et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[5] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[6] Mikel Artetxe,et al. Does Corpus Quality Really Matter for Low-Resource Languages? , 2022, EMNLP.

[7] Florian Tramèr,et al. Quantifying Memorization Across Neural Language Models , 2022, ICLR.

[8] Colin Raffel,et al. Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.

[9] Cherepanov,et al. Competition-level code generation with AlphaCode , 2022, Science.

[10] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[11] Hady Elsahar,et al. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources , 2022, ArXiv.

[12] Stella Biderman,et al. Datasheet for the Pile , 2022, ArXiv.

[13] Mustafa Ghaleb,et al. Masader: Metadata Sourcing for Arabic Text and Speech Data Resources , 2021, LREC.

[14] Nicholas Carlini,et al. Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[15] Dmytro Okhonko,et al. HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.

[16] William Agnew,et al. The Values Encoded in Machine Learning Research , 2021, FAccT.

[17] Pratyush Kumar,et al. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages , 2021, TACL.

[18] Ankur Bapna,et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[19] Brent J. Hecht,et al. Behavioral Use Licensing for Responsible AI , 2020, FAccT.

[20] Laura Forlano,et al. Participation Is not a Design Fix for Machine Learning , 2020, EAAMO.

[21] Dragomir R. Radev,et al. You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings , 2022, BIGSCIENCE.

[22] Marta Villegas,et al. ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions , 2022, PARLACLARIN.

[23] Daphne Ippolito,et al. Counterfactual Memorization in Neural Language Models , 2021, ArXiv.

[24] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[25] Alham Fikri Aji,et al. IndoNLI: A Natural Language Inference Dataset for Indonesian , 2021, EMNLP.

[26] Emily Denton,et al. Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development , 2021, Proc. ACM Hum. Comput. Interact..

[27] Aitor Gonzalez-Agirre,et al. Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan , 2021, FINDINGS.

[28] Laurent Romary,et al. Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus , 2021 .

[29] Anna Rogers,et al. Changing the World by Changing the Data , 2021, ACL.

[30] Jack Bandy,et al. Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus , 2021, ArXiv.

[31] Q. Vuong,et al. An AI-Enabled Approach in Analyzing Media Data: An Example from Data on COVID-19 News Coverage in Vietnam , 2021, Data.

[32] Elizabeth Bondi,et al. Envisioning Communities: A Participatory Approach Towards AI for Social Good , 2021, AIES.

[33] Jesse Dodge,et al. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.

[34] Graham Neubig,et al. MasakhaNER: Named Entity Recognition for African Languages , 2021, Transactions of the Association for Computational Linguistics.

[35] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[36] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[37] Rashedur M. Rahman,et al. Deep learning based question answering system in Bengali , 2020, J. Inf. Telecommun..

[38] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[39] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[40] Anik Tahabilder,et al. BanglaLM: Bangla Corpus for Language Model Research , 2021, SSRN Electronic Journal.

[41] Tommaso Caselli,et al. Guiding Principles for Participatory Design-inspired Natural Language Processing , 2021, NLP4POSIMPACT.

[42] Sha Yuan,et al. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models , 2021, AI Open.

[43] Davis David. Swahili : News Classification Dataset , 2020 .

[44] Nadir Durrani,et al. AraBench: Benchmarking Dialectal Arabic-English Machine Translation , 2020, COLING.

[45] Hady Elsahar,et al. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[46] Pedro Ortiz Suarez,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, Annual Meeting of the Association for Computational Linguistics.

[47] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[48] Mahmoud El-Haj,et al. Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus , 2020, LREC.

[49] C. V. Jawahar,et al. A Multilingual Parallel Corpora Collection Effort for Indian Languages , 2020, LREC.

[50] Mitesh M. Khapra,et al. AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages , 2020, ArXiv.

[51] Rico Sennrich,et al. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[52] Christopher D. Manning,et al. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[53] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[54] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[55] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[56] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[57] Daniel S. Weld,et al. S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[58] Benoît Sagot,et al. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[59] Zeljko Agic,et al. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[60] Alexander A. Alemi,et al. On the Use of ArXiv as a Dataset , 2019, ArXiv.

[61] Ashraf Elnagar,et al. SANAD: Single-label Arabic News Articles Dataset for automatic text categorization , 2019, Data in brief.

[62] Miltiadis Allamanis,et al. The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[63] Yonatan Belinkov,et al. Studying the history of the Arabic language: language technology and a large-scale historical corpus , 2018, Language Resources and Evaluation.

[64] Úlfar Erlingsson,et al. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[65] Hiroki Nomoto,et al. Interpersonal meaning annotation for Asian language corpora: The case of TUFS Asian Language Parallel Corpus (TALPCo) , 2019 .

[66] Ondřej Bojar,et al. OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation , 2019, Smart Intelligent Computing and Applications.

[67] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[68] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[69] M. Howell,et al. Ensuring Fairness in Machine Learning to Advance Health Equity , 2018, Annals of Internal Medicine.

[70] Kiet Van Nguyen,et al. UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis , 2018, 2018 10th International Conference on Knowledge and Systems Engineering (KSE).

[71] Samuel Louvan,et al. Indosum: A New Benchmark Dataset for Indonesian Text Summarization , 2018, 2018 International Conference on Asian Language Processing (IALP).

[72] Md. Atikur Rahman,et al. Datasets for Aspect-Based Sentiment Analysis in Bangla and Its Baseline Evaluation , 2018, Data.

[73] Sebastian Ruder,et al. Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[74] Xinyan Xiao,et al. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , 2017, QA@ACL.

[75] Pushpak Bhattacharyya,et al. The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[76] Ashraf Elnagar,et al. An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis , 2018, ACLING.

[77] Jan Vitek,et al. DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[78] L. Winner. DO ARTIFACTS HAVE (cid:1) POLITICS? , 2022 .

[79] Amar Balla,et al. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[80] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[81] Elke Teich,et al. The Royal Society Corpus: From Uncharted Data to Corpus , 2016, LREC.

[82] Marcin Junczys-Dowmunt,et al. The United Nations Parallel Corpus v1.0 , 2016, LREC.

[83] Ahmed Abdelali,et al. The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.

[84] Mark Steedman,et al. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2013 .

[85] David Moeljadi,et al. Usage of Indonesian possessive verbal predicates: a statistical analysis based on questionnaire and storytelling surveys , 2013 .

[86] Mahmoud El-Haj,et al. KALIMAT a multipurpose Arabic corpus , 2013 .

[87] Thomas Eckart,et al. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[88] Andreas Eisele,et al. MultiUN v2: UN Documents with Multilingual Alignments , 2012, LREC.

[89] Mauro Cettolo,et al. WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[90] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[91] Wiebke Wagner,et al. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[92] Hammam Riza,et al. Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System , 2009, ALR7@IJCNLP.

[93] Femphy Pisceldo. Probabilistic Part Of Speech Tagging for Bahasa Indonesia , 2009 .

[94] John Kunze,et al. The WARC File Format 1.0 (ISO 28500) , 2008 .

[95] Satoshi Nakamura,et al. Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.

[96] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[97] Guillaume Gravier,et al. Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[98] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[99] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.