The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled, in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.

[1]  Illtyd Trethowan Causality , 1938 .

[2]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[3]  R. Prentice Use of the logistic model in retrospective studies. , 1976, Biometrics.

[4]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[5]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[6]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[7]  Joshua D. Angrist,et al.  Identification of Causal Effects Using Instrumental Variables , 1993 .

[8]  T. Shakespeare,et al.  Observational Studies , 2003 .

[9]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[10]  Lawrence Joseph,et al.  Multiple Imputation to Account for Missing Data in a Survey: Estimating the Prevalence of Osteoporosis , 2002, Epidemiology.

[11]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[12]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[13]  Hude Quan,et al.  Bmc Medical Research Methodology Open Access Dealing with Missing Data in a Multi-question Depression Scale: a Comparison of Imputation Methods , 2022 .

[14]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[15]  Qihao Weng,et al.  A survey of image classification methods and techniques for improving classification performance , 2007 .

[16]  Benjamin M. Marlin,et al.  Missing Data Problems in Machine Learning , 2008 .

[17]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[18]  R. Little,et al.  Selection and pattern-mixture models , 2008 .

[19]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[20]  R. Berk,et al.  Forecasting murder within a population of probationers and parolees: a high stakes application of statistical learning , 2009 .

[21]  V. Chernozhukov,et al.  Inference on Counterfactual Distributions , 2009, 0904.0951.

[22]  Elizabeth A Stuart,et al.  Matching methods for causal inference: A review and a look forward. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[23]  Harald Steck,et al.  Training and testing of recommender systems on data missing not at random , 2010, KDD.

[24]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[25]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[26]  P. Austin An introduction to propensity-score methods for reducing confounding in observational studies , 2011 .

[27]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[28]  Poonam Gupta,et al.  A Survey of Text Question Answering Techniques , 2012 .

[29]  R. Little,et al.  The prevention and treatment of missing data in clinical trials. , 2012, The New England journal of medicine.

[30]  N. Freemantle,et al.  A propensity score matched comparison of different insulin regimens 1 year after beginning insulin in people with type 2 diabetes , 2013, Diabetes, obesity & metabolism.

[31]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[32]  Meghan McCormick,et al.  Teacher-child relationships and academic achievement: a multilevel propensity score model approach. , 2013, Journal of school psychology.

[33]  Mark Braverman,et al.  Data-Driven Decisions for Reducing Readmissions for Heart Failure: General Methodology and Case Study , 2014, PloS one.

[34]  Matching Methods in Practice: Three Examples , 2015 .

[35]  Cynthia Rudin,et al.  Interpretable classification models for recidivism prediction , 2015, 1503.07810.

[36]  Rayid Ghani,et al.  A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes , 2015, KDD.

[37]  Guido W. Imbens,et al.  Matching Methods in Practice: Three Examples , 2014, The Journal of Human Resources.

[38]  Jure Leskovec,et al.  Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making , 2016, NIPS.

[39]  Uri Shalit,et al.  Learning Representations for Counterfactual Inference , 2016, ICML.

[40]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[41]  Cynthia Rudin,et al.  Learning Cost-Effective and Interpretable Treatment Regimes , 2017, AISTATS.

[42]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[43]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[44]  John A. List,et al.  Multiple hypothesis testing in experimental economics , 2016, Experimental Economics.