Counterfactual Prediction Under Outcome Measurement Error

Across domains such as medicine, employment, and criminal justice, predictive models often target labels that imperfectly reflect the outcomes of interest to experts and policymakers. For example, clinical risk assessments deployed to inform physician decision-making often predict measures of healthcare utilization (e.g., costs, hospitalization) as a proxy for patient medical need. These proxies can be subject to outcome measurement error when they systematically differ from the target outcome they are intended to measure. However, prior modeling efforts to characterize and mitigate outcome measurement error overlook the fact that the decision being informed by a model often serves as a risk-mitigating intervention that impacts the target outcome of interest and its recorded proxy. Thus, in these settings, addressing measurement error requires counterfactual modeling of treatment effects on outcomes. In this work, we study intersectional threats to model reliability introduced by outcome measurement error, treatment effects, and selection bias from historical decision-making policies. We develop an unbiased risk minimization method which, given knowledge of proxy measurement error properties, corrects for the combined effects of these challenges. We also develop a method for estimating treatment-dependent measurement error parameters when these are unknown in advance. We demonstrate the utility of our approach theoretically and via experiments on real-world data from randomized controlled trials conducted in healthcare and employment domains. As importantly, we demonstrate that models correcting for outcome measurement error or treatment effects alone suffer from considerable reliability limitations. Our work underscores the importance of considering intersectional threats to model validity during the design and evaluation of predictive models for decision support.

[1]  Edward H. Kennedy,et al.  Counterfactual Risk Assessments under Unmeasured Confounding , 2022, ArXiv.

[2]  Kenneth Holstein,et al.  A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[3]  Inioluwa Deborah Raji,et al.  The Fallacy of AI Functionality , 2022, FAccT.

[4]  Zhiwei Steven Wu,et al.  Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support , 2022, CHI.

[5]  Adrian Weller,et al.  Racial Disparities in the Enforcement of Marijuana Violations in the US , 2022, AIES.

[6]  Edward H. Kennedy Semiparametric doubly robust targeted double machine learning: a review , 2022, 2203.06469.

[7]  Augustine Denteh,et al.  Who Increases Emergency Department Use? New Insights from the Oregon Health Insurance Experiment , 2022, SSRN Electronic Journal.

[8]  F. Ramus,et al.  Epidemiology of reading disability: A comparison of DSM-5 and ICD-11 criteria , 2021, Scientific Studies of Reading.

[9]  Alexandra Chouldechova,et al.  On the Validity of Arrest as a Proxy for Offense: Race and the Likelihood of Arrest for Violent Crimes , 2021, AIES.

[10]  Sendhil Mullainathan,et al.  On the Inequity of Predicting A While Hoping for B , 2021 .

[11]  Qiang Liu,et al.  VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments , 2021, ICLR.

[12]  Alexandra Chouldechova,et al.  The effect of differential victim crime reporting on predictive policing systems , 2021, FAccT.

[13]  A. Chouldechova,et al.  Leveraging Expert Consistency to Improve Algorithmic Decision Support , 2021, ArXiv.

[14]  Suchi Saria,et al.  Partial Identifiability in Discrete Data With Measurement Error , 2020, UAI.

[15]  Pheng-Ann Heng,et al.  Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise , 2020, AAAI.

[16]  Yang Liu,et al.  Fair Classification with Group-Dependent Label Noise , 2020, FAccT.

[17]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[18]  Isaac L. Chuang,et al.  Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[19]  Ivor W. Tsang,et al.  A Survey of Label-noise Representation Learning: Past, Present and Future , 2020, ArXiv.

[20]  Gang Niu,et al.  Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels , 2020, ICML.

[21]  Alexandra Chouldechova,et al.  Counterfactual Predictions under Runtime Confounding , 2020, NeurIPS.

[22]  Gang Niu,et al.  Parts-dependent Label Noise: Towards Instance-dependent Label Noise , 2020, ArXiv.

[23]  Alexandra Chouldechova,et al.  Fairness Evaluation in Presence of Biased Noisy Labels , 2020, AISTATS.

[24]  Celestine Mendler-Dünner,et al.  Performative Prediction , 2020, ICML.

[25]  Fredrik D. Johansson,et al.  Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects , 2020, J. Mach. Learn. Res..

[26]  Brandon M Stewart,et al.  What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory , 2021, American Sociological Review.

[27]  Edward H. Kennedy,et al.  Counterfactual risk assessments, evaluation, and fairness , 2019, FAT*.

[28]  R. Shanmugam Measuring crime: behind the statistics , 2020, Journal of Statistical Computation and Simulation.

[29]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[30]  David M. Blei,et al.  Adapting Neural Networks for the Estimation of Treatment Effects , 2019, NeurIPS.

[31]  Gang Niu,et al.  Are Anchor Points Really Indispensable in Label-Noise Learning? , 2019, NeurIPS.

[32]  Ata Kabán,et al.  Classification with unknown class conditional label noise on non-compact feature spaces , 2019, COLT.

[33]  Sören R. Künzel,et al.  Metalearners for estimating heterogeneous treatment effects using machine learning , 2017, Proceedings of the National Academy of Sciences.

[34]  Tianxi Cai,et al.  Pragmatic randomized clinical trials: best practices and statistical guidance , 2018, Health Services and Outcomes Research Methodology.

[35]  Nathan Kallus,et al.  Residual Unfairness in Fair Machine Learning from Prejudiced Data , 2018, ICML.

[36]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[37]  Grace Y Yi,et al.  Causal inference with measurement error in outcomes: Bias analysis and estimation methods , 2019, Statistical methods in medical research.

[38]  Jure Leskovec,et al.  The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables , 2017, KDD.

[39]  Sendhil Mullainathan,et al.  Does Machine Learning Automate Moral Hazard and Error? , 2017, The American economic review.

[40]  Christian Hansen,et al.  Double/Debiased/Neyman Machine Learning of Treatment Effects , 2017, 1701.08687.

[41]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[43]  Michael Luca,et al.  Supplemental Appendix for : Productivity and Selection of Human Capital with Machine Learning , 2016 .

[44]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  J. van der Laan,et al.  Sensitivity Analysis for Causal Inference Under Unmeasured Confounding and Measurement Error Problems , 2016 .

[46]  Brendan van Rooyen,et al.  Machine learning via transitions , 2015 .

[47]  Cheng Soon Ong,et al.  Learning from Corrupted Binary Labels via Class-Probability Estimation , 2015, ICML.

[48]  Aditya Krishna Menon,et al.  Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[49]  J. Kleinberg,et al.  Prediction Policy Problems. , 2015, The American economic review.

[50]  Hardeep Singh,et al.  The challenges in defining and measuring diagnostic error , 2015, Diagnosis.

[51]  Clayton Scott,et al.  A Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy Labels , 2015, AISTATS.

[52]  Robert P. Lieli,et al.  Estimating Conditional Average Treatment Effects , 2014 .

[53]  Candace Kruttschnitt,et al.  Estimating the Incidence of Rape and Sexual Assault , 2014 .

[54]  Shivani Agarwal,et al.  Surrogate regret bounds for bipartite ranking via strongly proper losses , 2012, J. Mach. Learn. Res..

[55]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[56]  V. Roger Epidemiology of Heart Failure , 2013, Circulation research.

[57]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[58]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[59]  J. Pearl Causal inference in statistics: An overview , 2009 .

[60]  M. Falagas,et al.  Under‐diagnosis of common chronic diseases: prevalence and impact on human health , 2007, International journal of clinical practice.

[61]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[62]  Henri C. Schouwenburg,et al.  Procrastination in Academic Settings: General Introduction. , 2004 .

[63]  Jeffrey A. Smith,et al.  Does Matching Overcome Lalonde's Critique of Nonexperimental Estimators? , 2000 .

[64]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[65]  W O Johnson,et al.  Estimation of sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown. , 2000, Preventive veterinary medicine.

[66]  Claes Enùea,et al.  Estimation of sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown , 2000 .

[67]  S D Walter,et al.  Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. , 1988, Journal of clinical epidemiology.

[68]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[69]  R. Lalonde Evaluating the Econometric Evaluations of Training Programs with Experimental Data , 1984 .

[70]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[71]  R A Greenes,et al.  Assessment of diagnostic tests when disease verification is subject to selection bias. , 1983, Biometrics.

[72]  S. Walter,et al.  Estimating the error rates of diagnostic tests. , 1980, Biometrics.

[73]  M. Kane Measurement theory. , 1980, NLN publications.

[74]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[75]  Paul Hur Using Machine Learning Explainability Methods to Personalize Interventions for Students , 2022 .