Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

[1]  Anne Oeldorf-Hirsch,et al.  The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services , 2020 .

[2]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[3]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[4]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[5]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[6]  Jesse Dodge,et al.  Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.

[7]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[8]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[9]  Alexandra Luccioni,et al.  What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus , 2021, ACL.

[10]  Ahmed Hosny,et al.  The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[11]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.

[12]  Mustafa Ghaleb,et al.  Masader: Metadata Sourcing for Arabic Text and Speech Data Resources , 2021, LREC.

[13]  Jack Bandy,et al.  Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus , 2021, ArXiv.

[14]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[15]  Melissa Terras,et al.  Crowdsourcing Bentham: Beyond the Traditional Boundaries of Academic History , 2014, Int. J. Humanit. Arts Comput..

[16]  Ya'akov Gal,et al.  Intervention Strategies for Increasing Engagement in Crowdsourcing: Platform, Predictions, and Experiments , 2016, IJCAI.

[17]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[18]  Tanya Y. Berger-Wolf,et al.  Wildbook: Crowdsourcing, computer vision, and data science for conservation , 2017, ArXiv.

[19]  Vinay Uday Prabhu,et al.  Large image datasets: A pyrrhic win for computer vision? , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Zhe Gan,et al.  Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.

[21]  Mark Liberman,et al.  A Progress Report on Activities at the Linguistic Data Consortium Benefitting the LREC Community , 2020, LREC.

[22]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[23]  Amandalynne Paullada,et al.  Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.

[24]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[25]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[26]  C. Lintott,et al.  Galaxy Zoo: Motivations of Citizen Scientists , 2008, 1303.6886.