A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks

Causal inference from observational data is the goal of many data analyses in the health and social sciences. However, academic statistics has often frowned upon data analyses with a causal objective. The introduction of the term "data science" provides a historic opportunity to redefine data analysis in such a way that it naturally accommodates causal inference from observational data. Like others before, we organize the scientific contributions of data science into three classes of tasks: Description, prediction, and counterfactual prediction (which includes causal inference). An explicit classification of data science tasks is necessary to discuss the data, assumptions, and analytics required to successfully accomplish each task. We argue that a failure to adequately describe the role of subject-matter expert knowledge in data analysis is a source of widespread misunderstandings about data science. Specifically, causal analyses typically require not only good data and algorithms, but also domain expert knowledge. We discuss the implications for the use of data science to guide decision-making in the real world and to train data scientists.

[1]  Galit Shmueli,et al.  To Explain or To Predict? , 2010 .

[2]  Judy Hall,et al.  The Book of Why , 2008 .

[3]  Dimiter Toshkov,et al.  Research Design in Political Science , 2016 .

[4]  M. Hernán,et al.  The birth weight "paradox" uncovered? , 2006, American journal of epidemiology.

[5]  Charles E. McCulloch,et al.  Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models , 2005 .

[6]  Sendhil Mullainathan,et al.  Machine Learning: An Applied Econometric Approach , 2017, Journal of Economic Perspectives.

[7]  Miguel A. Hernán Comment: Spherical Cows in a Vacuum: Data Analysis Competitions for Causal Inference , 2019, Statistical Science.

[8]  Miguel A Hernán,et al.  Invited commentary: Agent-based models for causal inference—reweighting data and theory in epidemiology. , 2015, American journal of epidemiology.

[9]  D. Donoho 50 Years of Data Science , 2017 .

[10]  J. Robins,et al.  Sensitivity Analysis for Selection bias and unmeasured Confounding in missing Data and Causal inference models , 2000 .

[11]  J. Tukey The Future of Data Analysis , 1962 .

[12]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[13]  J. Pearl,et al.  Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[14]  Tom M. Mitchell,et al.  What can machine learning do? Workforce implications , 2017, Science.

[15]  J. Robins Data, Design, and Background Knowledge in Etiologic Inference , 2001, Epidemiology.

[16]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[17]  I. Kohane,et al.  Big Data and Machine Learning in Health Care. , 2018, JAMA.

[18]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[19]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[20]  M. Hernán,et al.  Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. , 2002, American journal of epidemiology.

[21]  Judea Pearl,et al.  Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution , 2018, WSDM.

[22]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[23]  James M. Robins,et al.  Causal diagrams for epidemiologic research. , 1999 .

[24]  William S. Cleveland Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics , 2001 .

[25]  Ryen W. White,et al.  Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. , 2016, Journal of oncology practice.

[26]  J M Robins,et al.  The role of model selection in causal inference from nonexperimental data. , 1986, American journal of epidemiology.

[27]  D. Clayton,et al.  The Simpson's paradox unraveled. , 2011, International journal of epidemiology.

[28]  Miguel A Hernán,et al.  Does water kill? A call for less casual causal inferences. , 2016, Annals of epidemiology.