Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus

This paper contributes a formal case study in retrospective dataset documentation and pinpoints several problems with the influential BookCorpus dataset. Recent work has underscored the importance of dataset documentation in machine learning research, including by addressing “documentation debt” for datasets that have been used widely but documented sparsely. BookCorpus is one such dataset. Researchers have used BookCorpus to train OpenAI’s GPT-N models and Google’s BERT models, but little to no documentation exists about the dataset’s motivation, composition, collection process, etc. We offer a retrospective datasheet with key context and information about BookCorpus, including several notable deficiencies. In particular, we find evidence that (1) BookCorpus violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits significant skews in genre representation. We also find hints of other potential deficiencies that call for future research, such as lopsided author contributions. While more work remains, this initial effort to provide a datasheet for BookCorpus offers a cautionary case study and adds to growing literature that urges more careful, systematic documentation of machine learning datasets.

[1]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[4]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[5]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[6]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[7]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[8]  Ahmed Hosny,et al.  The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[9]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[10]  Abolfazl Asudeh,et al.  MithraLabel: Flexible Dataset Nutritional Labels for Responsible Data Science , 2019, CIKM.

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[13]  Michael S. Ryoo,et al.  AViD Dataset: Anonymized Videos from Diverse Countries , 2020, NeurIPS.

[14]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[16]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[17]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[18]  Challenges in Deploying Machine Learning: a Survey of Case Studies , 2020, ArXiv.

[19]  Data protection and privacy: Data protection and democracy , 2020 .

[20]  A. Hosny,et al.  The Dataset Nutrition Label , 2020, Data Protection and Privacy.

[21]  Mark A. Lemley,et al.  Fair Learning , 2020, SSRN Electronic Journal.

[22]  Xiaohua Zhai,et al.  Are we done with ImageNet? , 2020, ArXiv.

[23]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[24]  Emily Denton,et al.  Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.

[25]  Maarten Sap,et al.  Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.

[26]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[27]  Jonas Mueller,et al.  Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[28]  Amandalynne Paullada,et al.  Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.

[29]  Kai-Wei Chang,et al.  BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.

[30]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[31]  Ankur Bapna,et al.  Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets , 2021, TACL.