The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
暂无分享,去创建一个
David Ifeoluwa Adelani | Teven Le Scao | Leandro von Werra | Stella Rose Biderman | Javier de la Rosa | Pedro Ortiz Suarez | Albert Villanova del Moral | Kyle Lo | Yacine Jernite | Margaret Mitchell | Aitor Soroa Etxabe | Itziar Gonzalez-Dios | Anna Rogers | Tristan Thrush | Aaron Gokaslan | Jian Zhu | S. Longpre | Olivier Nguyen | Zaid Alyafeai | Manan Dey | Thomas Wang | Leon Weber | Sasha Luccioni | Pierre Colombo | Jenny Chim | Jorg Frohberg | Huu Nguyen | Maraim Masoud | Gérard Dupont | Somaieh Nikpoor | Christopher Akiki | F. Toni | Daniel Alexander van Strien | Shamik Bose | Hugo Laurenccon | Paulo Villegas | Quentin Lhoest | Lucile Saulnier | Long Phan | Angelina McMillan-Major | Chenghao Mou | Giada Pistilli | Khalid Almubarak | Mario vSavsko | Minh Chien Vu | Sebastian Nagel | S. Pai | Violette Lepercq | Loubna Ben Allal | I. Yu | H. Tran | E. G. Ponferrada | M. Muñoz | Suzana Ilic
[1] Pablo González de Prado Salas,et al. BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling , 2022, Proces. del Leng. Natural.
[2] Dragomir R. Radev,et al. Data Governance in the Age of Large-Scale Data-Driven Language Technology , 2022, FAccT.
[3] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[4] Stella Rose Biderman,et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.
[5] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[6] Mikel Artetxe,et al. Does Corpus Quality Really Matter for Low-Resource Languages? , 2022, EMNLP.
[7] Florian Tramèr,et al. Quantifying Memorization Across Neural Language Models , 2022, ICLR.
[8] Colin Raffel,et al. Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.
[9] Cherepanov,et al. Competition-level code generation with AlphaCode , 2022, Science.
[10] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.
[11] Hady Elsahar,et al. Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources , 2022, ArXiv.
[12] Stella Biderman,et al. Datasheet for the Pile , 2022, ArXiv.
[13] Mustafa Ghaleb,et al. Masader: Metadata Sourcing for Arabic Text and Speech Data Resources , 2021, LREC.
[14] Nicholas Carlini,et al. Deduplicating Training Data Makes Language Models Better , 2021, ACL.
[15] Dmytro Okhonko,et al. HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.
[16] William Agnew,et al. The Values Encoded in Machine Learning Research , 2021, FAccT.
[17] Pratyush Kumar,et al. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages , 2021, TACL.
[18] Ankur Bapna,et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.
[19] Brent J. Hecht,et al. Behavioral Use Licensing for Responsible AI , 2020, FAccT.
[20] Laura Forlano,et al. Participation Is not a Design Fix for Machine Learning , 2020, EAAMO.
[21] Dragomir R. Radev,et al. You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings , 2022, BIGSCIENCE.
[22] Marta Villegas,et al. ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions , 2022, PARLACLARIN.
[23] Daphne Ippolito,et al. Counterfactual Memorization in Neural Language Models , 2021, ArXiv.
[24] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[25] Alham Fikri Aji,et al. IndoNLI: A Natural Language Inference Dataset for Indonesian , 2021, EMNLP.
[26] Emily Denton,et al. Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development , 2021, Proc. ACM Hum. Comput. Interact..
[27] Aitor Gonzalez-Agirre,et al. Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan , 2021, FINDINGS.
[28] Laurent Romary,et al. Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus , 2021 .
[29] Anna Rogers,et al. Changing the World by Changing the Data , 2021, ACL.
[30] Jack Bandy,et al. Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus , 2021, ArXiv.
[31] Q. Vuong,et al. An AI-Enabled Approach in Analyzing Media Data: An Example from Data on COVID-19 News Coverage in Vietnam , 2021, Data.
[32] Elizabeth Bondi,et al. Envisioning Communities: A Participatory Approach Towards AI for Social Good , 2021, AIES.
[33] Jesse Dodge,et al. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.
[34] Graham Neubig,et al. MasakhaNER: Named Entity Recognition for African Languages , 2021, Transactions of the Association for Computational Linguistics.
[35] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[36] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[37] Rashedur M. Rahman,et al. Deep learning based question answering system in Bengali , 2020, J. Inf. Telecommun..
[38] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.
[39] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..
[40] Anik Tahabilder,et al. BanglaLM: Bangla Corpus for Language Model Research , 2021, SSRN Electronic Journal.
[41] Tommaso Caselli,et al. Guiding Principles for Participatory Design-inspired Natural Language Processing , 2021, NLP4POSIMPACT.
[42] Sha Yuan,et al. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models , 2021, AI Open.
[43] Davis David. Swahili : News Classification Dataset , 2020 .
[44] Nadir Durrani,et al. AraBench: Benchmarking Dialectal Arabic-English Machine Translation , 2020, COLING.
[45] Hady Elsahar,et al. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.
[46] Pedro Ortiz Suarez,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, Annual Meeting of the Association for Computational Linguistics.
[47] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[48] Mahmoud El-Haj,et al. Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus , 2020, LREC.
[49] C. V. Jawahar,et al. A Multilingual Parallel Corpora Collection Effort for Indian Languages , 2020, LREC.
[50] Mitesh M. Khapra,et al. AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages , 2020, ArXiv.
[51] Rico Sennrich,et al. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.
[52] Christopher D. Manning,et al. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.
[53] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[54] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.
[55] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.
[56] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[57] Daniel S. Weld,et al. S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.
[58] Benoît Sagot,et al. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .
[59] Zeljko Agic,et al. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.
[60] Alexander A. Alemi,et al. On the Use of ArXiv as a Dataset , 2019, ArXiv.
[61] Ashraf Elnagar,et al. SANAD: Single-label Arabic News Articles Dataset for automatic text categorization , 2019, Data in brief.
[62] Miltiadis Allamanis,et al. The adverse effects of code duplication in machine learning models of code , 2018, Onward!.
[63] Yonatan Belinkov,et al. Studying the history of the Arabic language: language technology and a large-scale historical corpus , 2018, Language Resources and Evaluation.
[64] Úlfar Erlingsson,et al. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.
[65] Hiroki Nomoto,et al. Interpersonal meaning annotation for Asian language corpora: The case of TUFS Asian Language Parallel Corpus (TALPCo) , 2019 .
[66] Ondřej Bojar,et al. OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation , 2019, Smart Intelligent Computing and Applications.
[67] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[68] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[69] M. Howell,et al. Ensuring Fairness in Machine Learning to Advance Health Equity , 2018, Annals of Internal Medicine.
[70] Kiet Van Nguyen,et al. UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis , 2018, 2018 10th International Conference on Knowledge and Systems Engineering (KSE).
[71] Samuel Louvan,et al. Indosum: A New Benchmark Dataset for Indonesian Text Summarization , 2018, 2018 International Conference on Asian Language Processing (IALP).
[72] Md. Atikur Rahman,et al. Datasets for Aspect-Based Sentiment Analysis in Bangla and Its Baseline Evaluation , 2018, Data.
[73] Sebastian Ruder,et al. Fine-tuned Language Models for Text Classification , 2018, ArXiv.
[74] Xinyan Xiao,et al. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , 2017, QA@ACL.
[75] Pushpak Bhattacharyya,et al. The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.
[76] Ashraf Elnagar,et al. An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis , 2018, ACLING.
[77] Jan Vitek,et al. DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..
[78] L. Winner. DO ARTIFACTS HAVE (cid:1) POLITICS? , 2022 .
[79] Amar Balla,et al. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.
[80] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.
[81] Elke Teich,et al. The Royal Society Corpus: From Uncharted Data to Corpus , 2016, LREC.
[82] Marcin Junczys-Dowmunt,et al. The United Nations Parallel Corpus v1.0 , 2016, LREC.
[83] Ahmed Abdelali,et al. The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.
[84] Mark Steedman,et al. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2013 .
[85] David Moeljadi,et al. Usage of Indonesian possessive verbal predicates: a statistical analysis based on questionnaire and storytelling surveys , 2013 .
[86] Mahmoud El-Haj,et al. KALIMAT a multipurpose Arabic corpus , 2013 .
[87] Thomas Eckart,et al. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.
[88] Andreas Eisele,et al. MultiUN v2: UN Documents with Multilingual Alignments , 2012, LREC.
[89] Mauro Cettolo,et al. WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.
[90] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.
[91] Wiebke Wagner,et al. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.
[92] Hammam Riza,et al. Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System , 2009, ALR7@IJCNLP.
[93] Femphy Pisceldo. Probabilistic Part Of Speech Tagging for Bahasa Indonesia , 2009 .
[94] John Kunze,et al. The WARC File Format 1.0 (ISO 28500) , 2008 .
[95] Satoshi Nakamura,et al. Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project , 2008, IJCNLP.
[96] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.
[97] Guillaume Gravier,et al. Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.
[98] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.
[99] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.