Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making

A growing literature on human-AI decision-making investigates strategies for combining human judgment with statistical models to improve decision-making. Research in this area often evaluates proposed improvements to models, interfaces, or workflows by demonstrating improved predictive performance on “ground truth’’ labels. However, this practice overlooks a key difference between human judgments and model predictions. Whereas humans commonly reason about broader phenomena of interest in a decision – including latent constructs that are not directly observable, such as disease status, the “toxicity” of online comments, or future “job performance” – predictive models target proxy labels that are readily available in existing datasets. Predictive models’ reliance on simplistic proxies for these nuanced phenomena makes them vulnerable to various sources of statistical bias. In this paper, we identify five sources of target variable bias that can impact the validity of proxy labels in human-AI decision-making tasks. We develop a causal framework to disentangle the relationship between each bias and clarify which are of concern in specific human-AI decision-making tasks. We demonstrate how our framework can be used to articulate implicit assumptions made in prior modeling work, and we recommend evaluation strategies for verifying whether these assumptions hold in practice. We then leverage our framework to re-examine the designs of prior human subjects experiments that investigate human-AI decision-making, finding that only a small fraction of studies examine factors related to target variable bias. We conclude by discussing opportunities to better address target variable bias in future research.

[1]  Solon Barocas,et al.  Against Predictive Optimization: On the Legitimacy of Decision-Making Algorithms that Optimize Predictive Accuracy , 2023, FAccT.

[2]  Kenneth Holstein,et al.  Toward Supporting Perceptual Complementarity in Human-AI Collaboration via Reflection on Unobservables , 2022, Proc. ACM Hum. Comput. Interact..

[3]  Amy X. Zhang,et al.  Building Human Values into Recommender Systems: An Interdisciplinary Synthesis , 2022, ACM Transactions on Recommender Systems.

[4]  D. Sontag,et al.  Sample Efficient Learning of Predictors that Complement Humans , 2022, ICML.

[5]  Kenneth Holstein,et al.  A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[6]  Michael P. Kim,et al.  Backward baselines: Is your model predicting the past? , 2022, ArXiv.

[7]  G. Satzger,et al.  On the Effect of Information Asymmetry in Human-AI Teams , 2022, ArXiv.

[8]  Scott A. Carter,et al.  You Complete Me: Human-AI Teams and Complementary Expertise , 2022, CHI.

[9]  Zhiwei Steven Wu,et al.  How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions , 2022, CHI.

[10]  Zhiwei Steven Wu,et al.  Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support , 2022, CHI.

[11]  Adrian Weller,et al.  Racial Disparities in the Enforcement of Marijuana Violations in the US , 2022, AIES.

[12]  A. Chouldechova,et al.  Human-Algorithm Collaboration: Achieving Complementarity and Avoiding Unfairness , 2022, FAccT.

[13]  Krzysztof Z Gajos,et al.  Do People Engage Cognitively with AI? Impact of AI Assistance on Incidental Learning , 2022, IUI.

[14]  Michael S. Bernstein,et al.  Jury Learning: Integrating Dissenting Voices into Machine Learning Models , 2022, CHI.

[15]  Q. Vera Liao,et al.  Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies , 2021, ArXiv.

[16]  Emily Denton,et al.  Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation , 2021, ArXiv.

[17]  David Sontag,et al.  Teaching Humans When To Defer to a Classifier via Examplars , 2021, AAAI.

[18]  Alexandra Chouldechova,et al.  The Impact of Algorithmic Risk Assessments on Human Predictions and its Analysis via Crowdsourcing Studies , 2021, Proc. ACM Hum. Comput. Interact..

[19]  Aaron Roth,et al.  Multiaccurate Proxies for Downstream Fairness , 2021, FAccT.

[20]  Sarah L. Desmarais,et al.  It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[21]  Sunny Consolvo,et al.  Designing Toxic Content Classification for a Diversity of Perspectives , 2021, SOUPS @ USENIX Security Symposium.

[22]  Matthew Lease,et al.  Human-AI Collaboration with Bandit Feedback , 2021, IJCAI.

[23]  Alexandra Chouldechova,et al.  On the Validity of Arrest as a Proxy for Offense: Race and the Likelihood of Arrest for Violent Crimes , 2021, AIES.

[24]  Carrie J. Cai,et al.  Onboarding Materials as Cross-functional Boundary Objects for Developing AI Assistants , 2021, CHI Extended Abstracts.

[25]  Ming Yin,et al.  Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making , 2021, IUI.

[26]  Krzysztof Z. Gajos,et al.  To Trust or to Think , 2021, Proc. ACM Hum. Comput. Interact..

[27]  Alexandra Chouldechova,et al.  The effect of differential victim crime reporting on predictive policing systems , 2021, FAccT.

[28]  A. Chouldechova,et al.  Leveraging Expert Consistency to Improve Algorithmic Decision Support , 2021, ArXiv.

[29]  Chenhao Tan,et al.  Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making , 2021, Proc. ACM Hum. Comput. Interact..

[30]  Yiling Chen,et al.  Algorithmic Risk Assessments Can Alter Human Decision-Making Processes in High-Stakes Government Contexts , 2020, Proc. ACM Hum. Comput. Interact..

[31]  Yang Liu,et al.  Fair Classification with Group-Dependent Label Noise , 2020, FAccT.

[32]  Viviana Patti,et al.  Resources and benchmark corpora for hate speech detection: a systematic review , 2020, Language Resources and Evaluation.

[33]  Moritz Hardt,et al.  From Optimizing Engagement to Measuring Value , 2020, FAccT.

[34]  Alexandra Chouldechova,et al.  Counterfactual Predictions under Runtime Confounding , 2020, NeurIPS.

[35]  Raymond Fok,et al.  Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance , 2020, CHI.

[36]  John Pavlopoulos,et al.  Toxicity Detection: Does Context Really Matter? , 2020, ACL.

[37]  Qiaozhu Mei,et al.  Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity , 2020, ICWSM.

[38]  N. Bowen,et al.  Latent Class Analysis: A Guide to Best Practice , 2020, Journal of Black Psychology.

[39]  Eric Horvitz,et al.  Learning to Complement Humans , 2020, IJCAI.

[40]  Alexandra Chouldechova,et al.  Fairness Evaluation in Presence of Biased Noisy Labels , 2020, AISTATS.

[41]  Krzysztof Z. Gajos,et al.  Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems , 2020, IUI.

[42]  Fredrik D. Johansson,et al.  Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects , 2020, J. Mach. Learn. Res..

[43]  Han Liu,et al.  "Why is 'Chicago' deceptive?" Towards Building Model-Driven Tutorials for Humans , 2020, CHI.

[44]  Yunfeng Zhang,et al.  Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making , 2020, FAT*.

[45]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[46]  Jennifer L. Doleac,et al.  Algorithmic Risk Assessment in the Hands of Humans , 2019, SSRN Electronic Journal.

[47]  Lauren Wilcox,et al.  "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making , 2019, Proc. ACM Hum. Comput. Interact..

[48]  BEN GREEN,et al.  The Principles and Limits of Algorithm-in-the-Loop Decision Making , 2019, Proc. ACM Hum. Comput. Interact..

[49]  Richard B. Berlin,et al.  A Slow Algorithm Improves Users' Assessments of the Algorithm's Accuracy , 2019, Proc. ACM Hum. Comput. Interact..

[50]  Rina Singh,et al.  Fairness Violations and Mitigation under Covariate Shift , 2019, FAccT.

[51]  Isaac L. Chuang,et al.  Confident Learning: Estimating Uncertainty in Dataset Labels , 2019, J. Artif. Intell. Res..

[52]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[53]  Jonathan Roth,et al.  Bias In, Bias Out? Evaluating the Folk Wisdom , 2019, FORC.

[54]  R. Geiger,et al.  ORES , 2019, Proc. ACM Hum. Comput. Interact..

[55]  Kori Inkpen Quinn,et al.  What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring , 2019, HCOMP.

[56]  Edward H. Kennedy,et al.  Counterfactual risk assessments, evaluation, and fairness , 2019, FAT*.

[57]  Ziad Obermeyer,et al.  NBER WORKING PAPER SERIES A MACHINE LEARNING APPROACH TO LOW-VALUE HEALTH CARE: WASTED TESTS, MISSED HEART ATTACKS AND MIS-PREDICTIONS , 2019 .

[58]  Sanmay Das,et al.  Allocating Interventions Based on Predicted Outcomes: A Case Study on Homelessness Services , 2019, AAAI.

[59]  David C. Parkes,et al.  Learning Representations by Humans, for Humans , 2019, ICML.

[60]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[61]  Vivian Lai,et al.  On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection , 2018, FAT.

[62]  Kori Inkpen Quinn,et al.  Investigating Human + Machine Complementarity for Recidivism Predictions , 2018, ArXiv.

[63]  Alexandra Chouldechova,et al.  Learning under selective labels in the presence of expert consistency , 2018, ArXiv.

[64]  Nathan Kallus,et al.  Residual Unfairness in Fair Machine Learning from Prejudiced Data , 2018, ICML.

[65]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[66]  Alexandra Chouldechova,et al.  A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions , 2018, FAT.

[67]  Joichi Ito,et al.  Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment , 2017, FAT.

[68]  Toniann Pitassi,et al.  Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , 2017, NeurIPS.

[69]  Jure Leskovec,et al.  The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables , 2017, KDD.

[70]  Suresh Venkatasubramanian,et al.  Runaway Feedback Loops in Predictive Policing , 2017, FAT.

[71]  Sendhil Mullainathan,et al.  Does Machine Learning Automate Moral Hazard and Error? , 2017, The American economic review.

[72]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[73]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[74]  Michael Luca,et al.  Supplemental Appendix for : Productivity and Selection of Human Capital with Machine Learning , 2016 .

[75]  Dympna O'Sullivan,et al.  The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems , 2015, 2015 International Conference on Healthcare Informatics.

[76]  Cheng Soon Ong,et al.  Learning from Corrupted Binary Labels via Class-Probability Estimation , 2015, ICML.

[77]  J. Kleinberg,et al.  Prediction Policy Problems. , 2015, The American economic review.

[78]  Berkeley J. Dietvorst,et al.  Algorithm Aversion: People Erroneously Avoid Algorithms after Seeing Them Err , 2014, Journal of experimental psychology. General.

[79]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[80]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[81]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[82]  Elizabeth L. Ogburn,et al.  Bias attenuation results for nondifferentially mismeasured ordinal and coarsened confounders. , 2013, Biometrika.

[83]  Vernon C. Smith,et al.  Predictive Modeling to Forecast Student Outcomes and Drive Effective Interventions in Online Community College Courses , 2012 .

[84]  Jeroen K. Vermunt,et al.  Latent class modeling with covariates : Two improved three-step approaches 1 , 2012 .

[85]  J. Pearl Causal inference in statistics: An overview , 2009 .

[86]  Paul R. Rosenbaum,et al.  Sensitivity Analysis in Observational Studies , 2005 .

[87]  J. Pearl Causal diagrams for empirical research , 1995 .

[88]  R. Dawes,et al.  Heuristics and Biases: Clinical versus Actuarial Judgment , 2002 .

[89]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[90]  R. Berk An introduction to sample selection bias in sociological data. , 1983 .

[91]  S. Walter,et al.  Estimating the error rates of diagnostic tests. , 1980, Biometrics.

[92]  H. Harman Modern factor analysis , 1961 .

[93]  K. Koedinger,et al.  Toward Improving Student Model Estimates through Assistance Scores in Principle and in Practice , 2021, EDM.

[94]  A. Narayanan,et al.  Fairness and Machine Learning Limitations and Opportunities , 2018 .

[95]  COMPAS Risk Scales : Demonstrating Accuracy Equity and Predictive Parity Performance of the COMPAS Risk Scales in Broward County , 2016 .

[96]  J. van der Laan,et al.  Sensitivity Analysis for Causal Inference Under Unmeasured Confounding and Measurement Error Problems , 2016 .

[97]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[98]  R. Aparasu Measurement theory and practice , 2010 .

[99]  Steve Pischke Lecture Notes on Measurement Error , 2007 .

[100]  Editors-in-chief,et al.  Encyclopedia of statistics in behavioral science , 2005 .

[101]  W. Grove,et al.  Clinical versus mechanical prediction: a meta-analysis. , 2000, Psychological assessment.

[102]  K. Anderson,et al.  Cardiovascular disease risk profiles. , 1991, American heart journal.

[103]  M. Kane Measurement theory. , 1980, NLN publications.

[104]  J. Heckman Sample selection bias as a specification error , 1979 .