Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
暂无分享,去创建一个
Hady Elsahar | Alham Fikri Aji | Daniel van Strien | Chris C. Emezue | Zaid Alyafeai | Nurulaqilla Khamis | Yacine Jernite | Stella Biderman | Colin Leong | Angelina McMillan-Major | Stella Rose Biderman | Maraim Masoud | Chris Emezue | Aitor Soroa | Pedro Ortiz Suarez | Zeerak Talat | Kimbo Chen | Francesco De Toni | G'erard Dupont | Suzana Ili'c | Yacine Jernite | Aitor Soroa Etxabe | Zaid Alyafeai | Hady ElSahar | Colin Leong | Maraim Masoud | Gérard Dupont | Zeerak Talat | F. Toni | Daniel Alexander van Strien | Nurulaqilla Khamis | Angelina McMillan-Major | Suzana Ili'c | Kimbo Chen
[1] Anne Oeldorf-Hirsch,et al. The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services , 2020 .
[2] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[3] Timnit Gebru,et al. Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.
[4] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.
[5] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.
[6] Jesse Dodge,et al. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.
[7] Sarah L. Nesbeitt. Ethnologue: Languages of the World , 1999 .
[8] Alexander M. Rush,et al. Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.
[9] Alexandra Luccioni,et al. What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus , 2021, ACL.
[10] Ahmed Hosny,et al. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.
[11] Ankur Bapna,et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.
[12] Mustafa Ghaleb,et al. Masader: Metadata Sourcing for Arabic Text and Speech Data Resources , 2021, LREC.
[13] Jack Bandy,et al. Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus , 2021, ArXiv.
[14] Ewan Klein,et al. Natural Language Processing with Python , 2009 .
[15] Melissa Terras,et al. Crowdsourcing Bentham: Beyond the Traditional Boundaries of Academic History , 2014, Int. J. Humanit. Arts Comput..
[16] Ya'akov Gal,et al. Intervention Strategies for Increasing Engagement in Crowdsourcing: Platform, Predictions, and Experiments , 2016, IJCAI.
[17] Praveen K. Paritosh,et al. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.
[18] Tanya Y. Berger-Wolf,et al. Wildbook: Crowdsourcing, computer vision, and data science for conservation , 2017, ArXiv.
[19] Vinay Uday Prabhu,et al. Large image datasets: A pyrrhic win for computer vision? , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[20] Zhe Gan,et al. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.
[21] Mark Liberman,et al. A Progress Report on Activities at the Linguistic Data Consortium Benefitting the LREC Community , 2020, LREC.
[22] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.
[23] Amandalynne Paullada,et al. Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.
[24] Emily M. Bender,et al. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.
[25] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[26] C. Lintott,et al. Galaxy Zoo: Motivations of Citizen Scientists , 2008, 1303.6886.