Documenting Computer Vision Datasets: An Invitation to Reflexive Data Practices

In industrial computer vision, discretionary decisions surrounding the production of image training data remain widely undocumented. Recent research taking issue with such opacity has proposed standardized processes for dataset documentation. In this paper, we expand this space of inquiry through fieldwork at two data processing companies and thirty interviews with data workers and computer vision practitioners. We identify four key issues that hinder the documentation of image datasets and the effective retrieval of production contexts. Finally, we propose reflexivity, understood as a collective consideration of social and intellectual factors that lead to praxis, as a necessary precondition for documentation. Reflexive documentation can help to expose the contexts, relations, routines, and power structures that shape data.

[1]  Bettina Berendt,et al.  AI for the Common Good?! Pitfalls, challenges, and ethics pen-testing , 2018, Paladyn J. Behav. Robotics.

[2]  Inioluwa Deborah Raji,et al.  Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing , 2020, FAT*.

[3]  Gaëlle Loosli,et al.  Baselines and a datasheet for the Cerema AWP dataset , 2018, ArXiv.

[4]  Emily Denton,et al.  Bringing the People Back In: Contesting Benchmark Machine Learning Datasets , 2020, ArXiv.

[5]  Kush R. Varshney,et al.  Increasing Trust in AI Services through Supplier's Declarations of Conformity , 2018, IBM J. Res. Dev..

[6]  Laurens Naudts How Machine Learning Generates Unfair Inequalities and How Data Protection Instruments May Help in Mitigating Them , 2018 .

[7]  Todd J. Wiebe The SAGE Encyclopedia of Action Research , 2015 .

[8]  Caitlin Lustig,et al.  How We've Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis , 2020, Proc. ACM Hum. Comput. Interact..

[9]  Solon Barocas,et al.  Problem Formulation and Fairness , 2019, FAT.

[10]  Michael J. Muller,et al.  How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation , 2019, CHI.

[11]  A. Davis Black Feminist Thought: Knowledge, Consciousness and the Politics of Empowerment , 1993 .

[12]  M. Six Silberman,et al.  Turkopticon: interrupting worker invisibility in amazon mechanical turk , 2013, CHI.

[13]  Kush R. Varshney,et al.  Experiences with Improving the Transparency of AI Models and Services , 2019, CHI Extended Abstracts.

[14]  Kent Andersen,et al.  White Fragility: Why It’s so Hard for White People to Talk about Racism , 2019, Journal of College and Character.

[15]  Mary Kay Thompson Tetreault,et al.  FRAMES OF POSITIONALITY: CONSTRUCTING MEANINGFUL DIALOGUES ABOUT GENDER AND RACE , 1993 .

[16]  P. Bourdieu,et al.  实践与反思 : 反思社会学导引 = An invitation to reflexive sociology , 1994 .

[17]  D. Smith The conceptual practices of power : a feminist sociology of knowledge , 1991 .

[18]  Ahmed Hosny,et al.  The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[19]  R. Stuart Geiger,et al.  Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from? , 2019, FAT*.

[20]  Emily Denton,et al.  Towards a critical race methodology in algorithmic fairness , 2019, FAT*.

[21]  Anselm L. Strauss,et al.  Basics of qualitative research : techniques and procedures for developing grounded theory , 1998 .

[22]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[23]  E. Brink,et al.  Constructing grounded theory : A practical guide through qualitative analysis , 2006 .

[24]  Natalia M. Libakova,et al.  The Method of Expert Interview as an Effective Research Procedure of Studying the Indigenous Peoples of the North , 2015 .

[25]  S. Harding Rethinking Standpoint Epistemology : What is « Strong Objectivity? » , 1992 .

[26]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[27]  B. Bourke Positionality: Reflecting on the Research Process , 2014 .

[28]  Hanna M. Wallach,et al.  A Human-Centered Agenda for Intelligible Machine Learning , 2021 .

[29]  Inioluwa Deborah Raji,et al.  Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products , 2019, AIES.

[30]  Milagros Miceli,et al.  Between Subjectivity and Imposition , 2020, Proc. ACM Hum. Comput. Interact..

[31]  Neide Mayumi Osada,et al.  Black Feminist Thought: Knowledge, Consciousness, and the Politics of Empowerment , 2008 .

[32]  Steven J. Jackson,et al.  Data Vision: Learning to See Through Algorithmic Abstraction , 2017, CSCW.

[33]  E. Bonilla-Silva Racism without racists : color-blind racism and the persistence of racial inequality in the United States , 2006 .

[34]  Hanna M. Wallach,et al.  Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI , 2020, CHI.

[35]  Mustafa Emirbayer,et al.  Race and reflexivity , 2012 .

[36]  Robert Perkinson The New Jim Crow: Mass Incarceration in the Age of Colorblindness , 2011 .

[37]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[38]  Natalia Kovalyova,et al.  Data feminism , 2020, Information, Communication & Society.

[39]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[40]  Amita Sharma,et al.  BRINGING THE PEOPLE BACK IN , 1997 .

[41]  Henriette Cramer,et al.  Translation, Tracks & Data: an Algorithmic Bias Effort in Practice , 2019, CHI Extended Abstracts.

[42]  M. Cannarsa Ethics Guidelines for Trustworthy AI , 2021, The Cambridge Handbook of Lawyering in the Digital Age.

[43]  Michael S. Bernstein,et al.  We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers , 2015, CHI.

[44]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[45]  Trevor Paglen,et al.  Correction to: Excavating AI: the politics of images in machine learning training sets , 2021, AI & SOCIETY.

[46]  M. C. Elish,et al.  Situating methods in the magic of Big Data and AI , 2018 .