The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

[1]  Pablo González de Prado Salas,et al.  BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling , 2022, Proces. del Leng. Natural.

[2]  Dragomir R. Radev,et al.  Data Governance in the Age of Large-Scale Data-Driven Language Technology , 2022, FAccT.

[3]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[4]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[5]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[6]  Mikel Artetxe,et al.  Does Corpus Quality Really Matter for Low-Resource Languages? , 2022, EMNLP.

[7]  Florian Tramèr,et al.  Quantifying Memorization Across Neural Language Models , 2022, ICLR.

[8]  Colin Raffel,et al.  Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.

[9]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[10]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[11]  Hady Elsahar,et al.  Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources , 2022, ArXiv.

[12]  Stella Biderman,et al.  Datasheet for the Pile , 2022, ArXiv.

[13]  Mustafa Ghaleb,et al.  Masader: Metadata Sourcing for Arabic Text and Speech Data Resources , 2021, LREC.

[14]  Nicholas Carlini,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[15]  Dmytro Okhonko,et al.  HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.

[16]  William Agnew,et al.  The Values Encoded in Machine Learning Research , 2021, FAccT.

[17]  Pratyush Kumar,et al.  Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages , 2021, TACL.

[18]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[19]  Brent J. Hecht,et al.  Behavioral Use Licensing for Responsible AI , 2020, FAccT.

[20]  Laura Forlano,et al.  Participation Is not a Design Fix for Machine Learning , 2020, EAAMO.

[21]  Dragomir R. Radev,et al.  You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings , 2022, BIGSCIENCE.

[22]  Marta Villegas,et al.  ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions , 2022, PARLACLARIN.

[23]  Daphne Ippolito,et al.  Counterfactual Memorization in Neural Language Models , 2021, ArXiv.

[24]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[25]  Alham Fikri Aji,et al.  IndoNLI: A Natural Language Inference Dataset for Indonesian , 2021, EMNLP.

[26]  Emily Denton,et al.  Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development , 2021, Proc. ACM Hum. Comput. Interact..

[27]  Aitor Gonzalez-Agirre,et al.  Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan , 2021, FINDINGS.

[28]  Laurent Romary,et al.  Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus , 2021 .

[29]  Anna Rogers,et al.  Changing the World by Changing the Data , 2021, ACL.

[30]  Jack Bandy,et al.  Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus , 2021, ArXiv.

[31]  Q. Vuong,et al.  An AI-Enabled Approach in Analyzing Media Data: An Example from Data on COVID-19 News Coverage in Vietnam , 2021, Data.

[32]  Elizabeth Bondi,et al.  Envisioning Communities: A Participatory Approach Towards AI for Social Good , 2021, AIES.

[33]  Jesse Dodge,et al.  Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.

[34]  Graham Neubig,et al.  MasakhaNER: Named Entity Recognition for African Languages , 2021, Transactions of the Association for Computational Linguistics.

[35]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[36]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[37]  Rashedur M. Rahman,et al.  Deep learning based question answering system in Bengali , 2020, J. Inf. Telecommun..

[38]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[39]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[40]  Anik Tahabilder,et al.  BanglaLM: Bangla Corpus for Language Model Research , 2021, SSRN Electronic Journal.

[41]  Tommaso Caselli,et al.  Guiding Principles for Participatory Design-inspired Natural Language Processing , 2021, NLP4POSIMPACT.

[42]  Sha Yuan,et al.  WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models , 2021, AI Open.

[43]  Davis David Swahili : News Classification Dataset , 2020 .

[44]  Nadir Durrani,et al.  AraBench: Benchmarking Dialectal Arabic-English Machine Translation , 2020, COLING.

[45]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[46]  Pedro Ortiz Suarez,et al.  A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, Annual Meeting of the Association for Computational Linguistics.

[47]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Mahmoud El-Haj,et al.  Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus , 2020, LREC.

[49]  C. V. Jawahar,et al.  A Multilingual Parallel Corpora Collection Effort for Indian Languages , 2020, LREC.

[50]  Mitesh M. Khapra,et al.  AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages , 2020, ArXiv.

[51]  Rico Sennrich,et al.  Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[52]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[53]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[54]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[55]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[56]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[57]  Daniel S. Weld,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[58]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[59]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[60]  Alexander A. Alemi,et al.  On the Use of ArXiv as a Dataset , 2019, ArXiv.

[61]  Ashraf Elnagar,et al.  SANAD: Single-label Arabic News Articles Dataset for automatic text categorization , 2019, Data in brief.

[62]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[63]  Yonatan Belinkov,et al.  Studying the history of the Arabic language: language technology and a large-scale historical corpus , 2018, Language Resources and Evaluation.

[64]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[65]  Hiroki Nomoto,et al.  Interpersonal meaning annotation for Asian language corpora: The case of TUFS Asian Language Parallel Corpus (TALPCo) , 2019 .

[66]  Ondřej Bojar,et al.  OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation , 2019, Smart Intelligent Computing and Applications.

[67]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[68]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[69]  M. Howell,et al.  Ensuring Fairness in Machine Learning to Advance Health Equity , 2018, Annals of Internal Medicine.

[70]  Kiet Van Nguyen,et al.  UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis , 2018, 2018 10th International Conference on Knowledge and Systems Engineering (KSE).

[71]  Samuel Louvan,et al.  Indosum: A New Benchmark Dataset for Indonesian Text Summarization , 2018, 2018 International Conference on Asian Language Processing (IALP).

[72]  Md. Atikur Rahman,et al.  Datasets for Aspect-Based Sentiment Analysis in Bangla and Its Baseline Evaluation , 2018, Data.

[73]  Sebastian Ruder,et al.  Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[74]  Xinyan Xiao,et al.  DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , 2017, QA@ACL.

[75]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[76]  Ashraf Elnagar,et al.  An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis , 2018, ACLING.

[77]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[78]  L. Winner DO ARTIFACTS HAVE (cid:1) POLITICS? , 2022 .

[79]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[80]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[81]  Elke Teich,et al.  The Royal Society Corpus: From Uncharted Data to Corpus , 2016, LREC.

[82]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[83]  Ahmed Abdelali,et al.  The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.

[84]  Mark Steedman,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2013 .

[85]  David Moeljadi,et al.  Usage of Indonesian possessive verbal predicates: a statistical analysis based on questionnaire and storytelling surveys , 2013 .

[86]  Mahmoud El-Haj,et al.  KALIMAT a multipurpose Arabic corpus , 2013 .

[87]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[88]  Andreas Eisele,et al.  MultiUN v2: UN Documents with Multilingual Alignments , 2012, LREC.

[89]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[90]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[91]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[92]  Hammam Riza,et al.  Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System , 2009, ALR7@IJCNLP.

[93]  Femphy Pisceldo Probabilistic Part Of Speech Tagging for Bahasa Indonesia , 2009 .

[94]  John Kunze,et al.  The WARC File Format 1.0 (ISO 28500) , 2008 .

[95]  Satoshi Nakamura,et al.  Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.

[96]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[97]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[98]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[99]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.