What failure to predict life outcomes can teach us

Social scientists are increasingly turning to supervised machine learning (SML), a set of methods optimized for using inputs from data to forecast an unobserved outcome, to offer predictions to aid policy (1). Recent work scrutinizes this approach for its suitability to social science questions (2, 3) as well as its potential for perpetuating social inequalities (4). In PNAS, Salganik et al. (5) take a step back and ask a more fundamental question: are individual behaviors and outcomes even predictable? Prediction is not a typical goal in the social sciences despite recent arguments that it should be (6). Social scientists focus on inference: that is, understanding how an outcome is related to some input. The researcher selects a few inputs, specifies a parametric (often linear) model to connect inputs to the outcome, and estimates the parameters from data. The result is a simple and interpretable model that performs well in the sample at hand. In SML, by contrast, the researcher includes many inputs, considers flexible (often nonparametric) models linking inputs to the outcome, and picks the model that best predicts the outcome in new data. The result is a complex model that might perform well out of sample but often offers little insight into the mechanism linking inputs to the outcome. Recent work connects these two cultures in different ways (7). First, researchers identify prediction tasks within the classical statistical framework and use SML to improve inference (8, 9). Second, scholars use predictions as a starting point to develop new theory (10). As SML becomes more mainstream in the social sciences and … [↵][1]1Email: fgarip{at}cornell.edu. [1]: #xref-corresp-1-1

[1]  Leif D. Nelson,et al.  Data from Paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” , 2014 .

[2]  G. King,et al.  Improving Quantitative Studies of International Conflict: A Conjecture , 2000, American Political Science Review.

[3]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[4]  Filiz Garip,et al.  Machine Learning for Sociology , 2019, Annual Review of Sociology.

[5]  T. Yarkoni,et al.  Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning , 2017, Perspectives on psychological science : a journal of the Association for Psychological Science.

[6]  Antje Kirchner,et al.  Measuring the predictability of life outcomes with a scientific mass collaboration , 2020, Proceedings of the National Academy of Sciences.

[7]  Douglas B. Downey,et al.  Black/White Differences in School Performance: The Oppositional Culture Explanation , 2008 .

[8]  Michael Luca,et al.  Crowdsourcing City Government: Using Tournaments to Improve Inspection Accuracy , 2016 .

[9]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[10]  J. Kleinberg,et al.  Prediction Policy Problems. , 2015, The American economic review.

[11]  Jeremy Freese,et al.  Replication Standards for Quantitative Social Science , 2007 .

[12]  Justin Grimmer,et al.  Estimating Heterogeneous Treatment Effects and the Effects of Heterogeneous Treatments with Ensemble Methods , 2017, Political Analysis.

[13]  David Card The Causal Effect of Education on Learning , 1999 .

[14]  D. Watts Common Sense and Sociological Explanations1 , 2014, American Journal of Sociology.

[15]  Sendhil Mullainathan,et al.  Machine Learning: An Applied Econometric Approach , 2017, Journal of Economic Perspectives.

[16]  D. Donoho 50 Years of Data Science , 2017 .

[17]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.