Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus
暂无分享,去创建一个
[1] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[2] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[3] Adam Tauman Kalai,et al. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.
[4] Arvind Narayanan,et al. Semantics derived automatically from language corpora contain human-like biases , 2016, Science.
[5] Neoklis Polyzotis,et al. Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..
[6] Jason Baldridge,et al. Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.
[7] Emily M. Bender,et al. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.
[8] Ahmed Hosny,et al. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.
[9] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[10] Abolfazl Asudeh,et al. MithraLabel: Flexible Dataset Nutritional Labels for Responsible Data Science , 2019, CIKM.
[11] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[12] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[13] Michael S. Ryoo,et al. AViD Dataset: Anonymized Videos from Diverse Countries , 2020, NeurIPS.
[14] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[15] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[16] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[17] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[18] Challenges in Deploying Machine Learning: a Survey of Case Studies , 2020, ArXiv.
[19] Data protection and privacy: Data protection and democracy , 2020 .
[20] A. Hosny,et al. The Dataset Nutrition Label , 2020, Data Protection and Privacy.
[21] Mark A. Lemley,et al. Fair Learning , 2020, SSRN Electronic Journal.
[22] Xiaohua Zhai,et al. Are we done with ImageNet? , 2020, ArXiv.
[23] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.
[24] Emily Denton,et al. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.
[25] Maarten Sap,et al. Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.
[26] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[27] Jonas Mueller,et al. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.
[28] Amandalynne Paullada,et al. Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.
[29] Kai-Wei Chang,et al. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.
[30] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[31] Ankur Bapna,et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.