Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

Datasets that power machine learning are often used, shared, and reused with little visibility into the processes of deliberation that led to their creation. As artificial intelligence systems are increasingly used in high-stakes tasks, system development and deployment practices must be adapted to address the very real consequences of how model development data is constructed and used in practice. This includes greater transparency about data, and accountability for decisions made when developing it. In this paper, we introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields documents that facilitate improved communication and decision-making, as well as drawing attention to the value and necessity of careful data work. The proposed framework makes visible the often overlooked work and decisions that go into dataset creation, a critical step in closing the accountability gap in artificial intelligence and a critical/necessary resource aligned with recent work on auditing processes.

[1]  G. Ryle I.—Knowing How and Knowing that: The Presidential Address , 1946 .

[2]  H. Rittel,et al.  Dilemmas in a general theory of planning , 1973 .

[3]  Herbert A. Simon,et al.  The Structure of Ill Structured Problems , 1973, Artif. Intell..

[4]  B. Latour,et al.  Laboratory Life: The Construction of Scientific Facts , 1979 .

[5]  Ian Hacking,et al.  Representing and Intervening: Introductory Topics in the Philosophy of Natural Science , 1983 .

[6]  T. Nagel The view from nowhere , 1987 .

[7]  B. Latour Science in action : how to follow scientists and engineers through society , 1989 .

[8]  D. Haraway Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective , 1988 .

[9]  Heinz Eulau Crossroads of Social Science: The ICPSR 25th Anniversary Volume , 1989 .

[10]  Standard Glossary of Software Engineering Terminology , 1990 .

[11]  Richard Buchanan,et al.  Wicked Problems in Design Thinking , 1992 .

[12]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[13]  Dale Goodhue,et al.  Develop Long-Term Competitiveness through IT Assets , 1996 .

[14]  H. Nissenbaum Accountability in a computerized society , 1997 .

[15]  Eldon C. Hall Journey to the Moon: The History of the Apollo Guidance Computer , 1996 .

[16]  P. Kidwell Journey to the moon: the history of the Apollo guidance computer , 1999, IEEE Annals of the History of Computing.

[17]  Eric Livingston,et al.  Cultures of Proving , 1999 .

[18]  Anandhi S. Bharadwaj,et al.  A Resource-Based Perspective on Information Technology Capability and Firm Performance: An Empirical Investigation , 2000, MIS Q..

[19]  Keith H. Bennett,et al.  Software maintenance and evolution: a roadmap , 2000, ICSE '00.

[20]  Ron Patton,et al.  Software Testing , 2000 .

[21]  J. Kennedy The relationship between science and technology , 2001 .

[22]  J. Overhage,et al.  Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.

[23]  B. V. Koen,et al.  Discussion of the Method : Conducting the Engineer's Approach to Problem Solving , 2003 .

[24]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[25]  Steven L. Goldman,et al.  Why we need a philosophy of engineering: a work in progress , 2004 .

[26]  I. Stevenson,et al.  Data on data , 2004 .

[27]  Deborah K. Heikes,et al.  The Bias Paradox: Why it's Not Just for Feminists Anymore , 2004, Synthese.

[28]  Herb Sutter,et al.  C++ Coding Standards: 101 Rules, Guidelines, and Best Practices (C++ in Depth Series) , 2004 .

[29]  Torgeir Dingsøyr,et al.  Postmortem reviews: purpose and approaches in software engineering , 2005, Inf. Softw. Technol..

[30]  N. Mackenzie,et al.  Research dilemmas: Paradigms, methods and methodology , 2006 .

[31]  Gary King,et al.  An Introduction to the Dataverse Network as an Infrastructure for Data Sharing , 2007 .

[32]  Peter M. Bednar,et al.  Contextual Inquiry and Requirements Shaping , 2007, ISD.

[33]  Gary Klein,et al.  Performing a Project Premortem , 2008, IEEE Engineering Management Review.

[34]  David Loshin Metadata Management for MDM , 2008 .

[35]  David Loshin Master Data Management , 2008 .

[36]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[37]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[38]  David Loshin,et al.  The Practitioner's Guide to Data Quality Improvement , 2010 .

[39]  David Shoemaker,et al.  Attributability, Answerability, and Accountability: Toward a Wider Theory of Moral Responsibility* , 2011, Ethics.

[40]  Janet Abbate,et al.  Recoding Gender: Women's Changing Participation in Computing , 2012 .

[41]  I. Arel Deep Reinforcement Learning as Foundation for Artificial General Intelligence , 2012 .

[42]  Brett M. Frischmann Infrastructure: The Social Value of Shared Resources , 2012 .

[43]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[44]  Heather A. Piwowar,et al.  Data reuse and the open data citation advantage , 2013, PeerJ.

[45]  Michael Weisberg,et al.  Biology and Philosophy symposium on Simulation and Similarity: Using Models to Understand the World , 2013 .

[46]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[47]  B. Larkin The Politics and Poetics of Infrastructure , 2013 .

[48]  Victoria Stodden,et al.  Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research , 2014 .

[49]  Irfan A. Alvi Engineers Need to Get Real, But Can't: The Role of Models , 2013 .

[50]  M. Six Silberman,et al.  Turkopticon: interrupting worker invisibility in amazon mechanical turk , 2013, CHI.

[51]  F. Hanusch,et al.  Journalism Students’ Professional Views in Eight Countries: The Role of Motivations, Education, and Gender , 2014 .

[52]  Ernest Davis The Limitations of Standardized Science Tests as Benchmarks for Artificial Intelligence Research: Position Paper , 2014, ArXiv.

[53]  Elihu Katz Kanonizing Katz| Commuting and Co-Authoring: How To Be in More Than One Place at the Same Time , 2014 .

[54]  Frank A. Pasquale,et al.  [89WashLRev0001] The Scored Society: Due Process for Automated Predictions , 2014 .

[55]  Kate M. Miltner,et al.  Big Data| Critiquing Big Data: Politics, Ethics, Epistemology | Special Section Introduction , 2014 .

[56]  Brendan Hall,et al.  Distributed System Design Checklist , 2014 .

[57]  Lawrence Busch,et al.  Big Data, Big Questions| A Dozen Ways to Get Lost in Translation: Inherent Challenges in Large Scale Data Sets , 2014 .

[58]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[59]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  David W. Shoemaker Responsibility from the Margins , 2015 .

[61]  Luiz Fernando Capretz,et al.  Influence of personality types in software tasks choices , 2015, Comput. Hum. Behav..

[62]  Vahid Garousi,et al.  Cost, benefits and quality of software development documentation: A systematic mapping , 2015, J. Syst. Softw..

[63]  Tonatiuh Rodriguez-Nikl,et al.  Philosophy of Engineering: What It Is and Why It Matters , 2015 .

[64]  Bent Flyvbjerg,et al.  The Principle of the Malevolent Hiding Hand; or, the Planning Fallacy Writ Large , 2015, 1509.01526.

[65]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[66]  Rick O. Gilmore,et al.  Curating identifiable data for sharing: The databrary project , 2016, 2016 New York Scientific Data Summit (NYSDS).

[67]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[68]  E. Dougherty,et al.  Big data need big theory too , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[69]  Matthew Kelly,et al.  Information Cultures in the Digital Age , 2016 .

[70]  Anna Lauren Hoffmann,et al.  Digitizing Books, Obscuring Women’s Work: Google Books, Librarians, and Ideologies of Access , 2016 .

[71]  Mariarosaria Taddeo,et al.  The ethics of algorithms: Mapping the debate , 2016, Big Data Soc..

[72]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[73]  L. Winner DO ARTIFACTS HAVE (cid:1) POLITICS? , 2022 .

[74]  Solon Barocas,et al.  Ten simple rules for responsible big data research , 2017, PLoS Comput. Biol..

[75]  Shrikanth S. Narayanan,et al.  Designing Contestability: Interaction Design, Machine Learning, and Mental Health , 2017, Conference on Designing Interactive Systems.

[76]  Jennifer Wortman Vaughan Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research , 2017, J. Mach. Learn. Res..

[77]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[78]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[79]  Gina Neff,et al.  Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science , 2017, Big Data.

[80]  D. Sculley,et al.  No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World , 2017, 1711.08536.

[81]  J. Jordana,et al.  Accountability Challenges in the Governance of Infrastructure , 2017 .

[82]  Helmut K. Anheier Infrastructure and the Principle of the Hiding Hand , 2017 .

[83]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[84]  D. Sculley,et al.  The ML test score: A rubric for ML production readiness and technical debt reduction , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[85]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[86]  Christine L. Borgman,et al.  On the Reuse of Scientific Data , 2017, Data Sci. J..

[87]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[88]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[89]  Nitin Kohli,et al.  Translation Tutorial : A Shared Lexicon for Research and Practice in Human-Centered Software Systems , 2018 .

[90]  Matei Zaharia,et al.  Provenance Analysis for Missing Answers and Integrity Repairs. , 2018 .

[91]  D. Sculley,et al.  Winner's Curse? On Pace, Progress, and Empirical Rigor , 2018, ICLR.

[92]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[93]  Virginia E. Eubanks Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor , 2018 .

[94]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[95]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[96]  Juan Carlos De Martin,et al.  Ethical and Socially-Aware Data Labels , 2018, SIMBig.

[97]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[98]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[99]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[100]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[101]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[102]  Ahmed Hosny,et al.  The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[103]  Miroslav Dudík,et al.  Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? , 2018, CHI.

[104]  Nanyun Peng,et al.  Do Nuclear Submarines Have Nuclear Captains? A Challenge Dataset for Commonsense Reasoning over Adjectives and Objects , 2019, EMNLP/IJCNLP.

[105]  Solon Barocas,et al.  Problem Formulation and Fairness , 2019, FAT.

[106]  E. Topol,et al.  A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. , 2019, The Lancet. Digital health.

[107]  Ian Taylor,et al.  Towards Traceability in Data Ecosystems using a Bill of Materials Model , 2019, ArXiv.

[108]  Kush R. Varshney,et al.  Increasing Trust in AI Services through Supplier's Declarations of Conformity , 2018, IBM J. Res. Dev..

[109]  Andreas Vogelsang,et al.  Requirements Engineering for Machine Learning: Perspectives from Data Scientists , 2019, 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW).

[110]  Ingmar Weber,et al.  Racial Bias in Hate Speech and Abusive Language Detection Datasets , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[111]  S. E. Sachs,et al.  The algorithm at work? Explanation and repair in the enactment of similarity in art data , 2020, Information, Communication & Society.

[112]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[113]  Laurens van der Maaten,et al.  Does Object Recognition Work for Everyone? , 2019, CVPR Workshops.

[114]  Eduard Hovy,et al.  Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization , 2019, EMNLP.

[115]  Radu Calinescu,et al.  Assuring the Machine Learning Lifecycle , 2019, ACM Comput. Surv..

[116]  K. Crawford,et al.  Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice , 2019 .

[117]  Inioluwa Deborah Raji,et al.  ABOUT ML: Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles , 2019, ArXiv.

[118]  Danah Boyd,et al.  Fairness and Abstraction in Sociotechnical Systems , 2019, FAT.

[119]  Deirdre K. Mulligan,et al.  Shaping Our Tools: Contestability as a Means to Promote Responsible Algorithmic Decision Making in the Professions , 2019 .

[120]  Stuart N. Lane Editorial 2020 Part II: Data from nowhere? , 2019 .

[121]  Lora Aroyo,et al.  Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.

[122]  Cultures of Programming , 2019 .

[123]  Harald C. Gall,et al.  Software Engineering for Machine Learning: A Case Study , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[124]  Sabina Leonelli,et al.  Scientific research and big data , 2020 .

[125]  Inioluwa Deborah Raji,et al.  Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing , 2020, FAT*.

[126]  Jeanna Neefe Matthews,et al.  Quantifying Gender Bias in Different Corpora , 2020, WWW.

[127]  Hirotoshi Yasuoka,et al.  Engineering problems in machine learning systems , 2019, Machine Learning.

[128]  Stefanie N. Lindstaedt,et al.  SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle , 2019, CIDR.

[129]  Emily Denton,et al.  Social Biases in NLP Models as Barriers for Persons with Disabilities , 2020, ACL.

[130]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[131]  Maranke Wieringa,et al.  What to account for when accounting for algorithms: a systematic literature review on algorithmic accountability , 2020, FAT*.

[132]  Grace Hui Yang,et al.  More Diverse Dialogue Datasets via Diversity-Informed Data Collection , 2020, ACL.

[133]  Abolfazl Asudeh,et al.  Fair Active Learning , 2020, Expert Syst. Appl..

[134]  Hanmeng Liu,et al.  LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning , 2020, IJCAI.

[135]  Emily Denton,et al.  Bringing the People Back In: Contesting Benchmark Machine Learning Datasets , 2020, ArXiv.

[136]  Benjamin Heinzerling,et al.  NLP's Clever Hans Moment has Arrived , 2020 .

[137]  Michael Herrmann,et al.  From Principles to Practice : An interdisciplinary framework to operationalise AI ethics , 2020 .

[138]  Andrew Smart,et al.  Why Reliabilism Is not Enough: Epistemic and Moral Justification in Machine Learning , 2020, AIES.

[139]  R. Stuart Geiger,et al.  Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from? , 2019, FAT*.

[140]  Caitlin Lustig,et al.  How We've Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis , 2020, Proc. ACM Hum. Comput. Interact..

[141]  Ben Green,et al.  Data Science as Political Action: Grounding Data Science in a Politics of Justice , 2018, J. Soc. Comput..

[142]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[143]  Vinay Uday Prabhu,et al.  Large image datasets: A pyrrhic win for computer vision? , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[144]  Joel Walmsley,et al.  Artificial intelligence and the value of transparency , 2020, AI & SOCIETY.

[145]  Steven Euijong Whang,et al.  A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective , 2018, IEEE Transactions on Knowledge and Data Engineering.

[146]  Trevor Paglen,et al.  Correction to: Excavating AI: the politics of images in machine learning training sets , 2021, AI & SOCIETY.

[147]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[148]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[149]  Wolfram Wöß,et al.  A Survey of Data Quality Measurement and Monitoring Tools , 2019, Frontiers in Big Data.