Problem Formulation and Fairness

Formulating data science problems is an uncertain and difficult process. It requires various forms of discretionary work to translate high-level objectives or strategic goals into tractable problems, necessitating, among other things, the identification of appropriate target variables and proxies. While these choices are rarely self-evident, normative assessments of data science projects often take them for granted, even though different translations can raise profoundly different ethical concerns. Whether we consider a data science project fair often has as much to do with the formulation of the problem as any property of the resulting model. Building on six months of ethnographic fieldwork with a corporate data science team---and channeling ideas from sociology and history of science, critical data studies, and early writing on knowledge discovery in databases---we describe the complex set of actors and activities involved in problem formulation. Our research demonstrates that the specification and operationalization of the problem are always negotiated and elastic, and rarely worked out with explicit normative considerations in mind. In so doing, we show that careful accounts of everyday data science work can help us better understand how and why data science problems are posed in certain ways---and why specific formulations prevail in practice, even in the face of what might seem like normatively preferable alternatives. We conclude by discussing the implications of our findings, arguing that effective normative interventions will require attending to the practical work of problem formulation.

[1]  Lukasz A. Kurgan,et al.  A survey of Knowledge Discovery and Data Mining process models , 2006, The Knowledge Engineering Review.

[2]  Kyle Kubler The Black Box Society: the secret algorithms that control money and information , 2016 .

[3]  Daniel A. McFarland,et al.  Big Data and the danger of being precisely inaccurate , 2015, Big Data Soc..

[4]  David J. Hand,et al.  Deconstructing Statistical Questions , 1994 .

[5]  Shion Guha,et al.  Machine Learning and Grounded Theory Method: Convergence, Divergence, and Combination , 2016, GROUP.

[6]  K. Lum,et al.  To predict and serve? , 2016 .

[7]  C. Bazerman Changing Order: Replication and Induction in Scientific Practice , 1989 .

[8]  Mohamed Medhat Gaber,et al.  Journeys to Data Mining , 2012, Springer Berlin Heidelberg.

[9]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[10]  T. Pinch,et al.  The Social Construction of Facts and Artefacts: or How the Sociology of Science and the Sociology of Technology might Benefit Each Other , 1984 .

[11]  Steven J. Jackson,et al.  Trust in Data Science , 2018, Proc. ACM Hum. Comput. Interact..

[12]  Jack Cook Ethics of Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[13]  Mary Elizabeth Lynch,et al.  The externalized retina: Selection and mathematization in the visual documentation of objects in the life sciences , 1988 .

[14]  Solon Barocas,et al.  Big Data, Data Science, and Civil Rights , 2017, ArXiv.

[15]  Anselm L. Strauss,et al.  Basics of qualitative research : techniques and procedures for developing grounded theory , 1998 .

[16]  Gernot Rieder,et al.  Datatrust: Or, the political quest for numerical evidence and the epistemologies of Big Data , 2016 .

[17]  Kevin Carillo,et al.  Let's stop trying to be "sexy" - preparing managers for the (big) data-driven business era , 2017, Bus. Process. Manag. J..

[18]  David J. Hand,et al.  Protection or Privacy? Data Mining and Personal Data , 2006, PAKDD.

[19]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[20]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[21]  Paul Dourish,et al.  Datafication and data fiction: Narrating data and narrating with data , 2018, Big Data Soc..

[22]  Michael Lynch,et al.  Discipline and the Material Form of Images: An Analysis of Scientific Visibility , 1985 .

[23]  Jennifer Pierre,et al.  The conundrum of police officer-involved homicides: Counter-data in Los Angeles County , 2016, Big Data Soc..

[24]  Gina Neff,et al.  Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science , 2017, Big Data.

[25]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[26]  Bernward Joerges,et al.  A Fresh Look at Instrumentation an Introduction , 2001 .

[27]  Lucas D. Introna Algorithms, Governance, and Governmentality , 2016 .

[28]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[29]  Helen Nissenbaum,et al.  Bias in computer systems , 1996, TOIS.

[31]  L. Gitelman "Raw Data" Is an Oxymoron , 2013 .

[32]  Steven J. Jackson,et al.  Data Vision: Learning to See Through Algorithmic Abstraction , 2017, CSCW.

[33]  Solon Barocas,et al.  Engaging the ethics of data science in practice , 2017, Commun. ACM.

[34]  Justin Grimmer,et al.  We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together , 2014, PS: Political Science & Politics.

[35]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[36]  B. Latour,et al.  Laboratory Life: The Construction of Scientific Facts , 1979 .

[37]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[38]  Mohamed Medhat Gaber,et al.  Journeys to Data Mining: Experiences from 15 Renowned Researchers , 2012 .

[39]  Peter Danielson Metaphors and Models for Data Mining Ethics , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[40]  Lawrence Busch,et al.  Big Data, Big Questions| A Dozen Ways to Get Lost in Translation: Inherent Challenges in Large Scale Data Sets , 2014 .

[41]  Tom Fawcett,et al.  Data science for business , 2013 .