Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies

As AI systems demonstrate increasingly strong predictive performance, their adoption has grown in numerous domains. However, in high-stakes domains such as criminal justice and healthcare, full automation is often not desirable due to safety, ethical, and legal concerns, yet fully manual approaches can be inaccurate and time consuming. As a result, there is growing interest in the research community to augment human decision making with AI assistance. Besides developing AI technologies for this purpose, the emerging field of human-AI decision making must embrace empirical approaches to form a foundational understanding of how humans interact and work with AI to make decisions. To invite and help structure research efforts towards a science of understanding and improving human-AI decision making, we survey recent literature of empirical human-subject studies on this topic. We summarize the study design choices made in over 100 papers in three important aspects: (1) decision tasks, (2) AI models and AI assistance elements, and (3) evaluation metrics. For each aspect, we summarize current trends, discuss gaps in current practices of the field, and make a list of recommendations for future research. Our survey highlights the need to develop common frameworks to account for the design and research spaces of human-AI decision making, so that researchers can make rigorous choices in study design, and the research community can build on each other’s work and produce generalizable scientific knowledge. We also hope this survey will serve as a bridge for HCI and AI communities to work together to mutually shape the empirical science and computational technologies for human-AI decision making.

[1]  Patrick D. McDaniel,et al.  Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning , 2018, ArXiv.

[2]  E. Siegel,et al.  Artificial Intelligence in Medicine and Cardiac Imaging: Harnessing Big Data and Advanced Computing to Provide Personalized Medical Diagnosis and Treatment , 2013, Current Cardiology Reports.

[3]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[4]  Raymond Fok,et al.  Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance , 2020, CHI.

[5]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[6]  Paul Johns,et al.  Do I Look Like a Criminal? Examining how Race Presentation Impacts Human Judgement of Recidivism , 2020, CHI.

[7]  A. Chouldechova,et al.  Toward Algorithmic Accountability in Public Services: A Qualitative Study of Affected Community Perspectives on Algorithmic Decision-making in Child Welfare Services , 2019, CHI.

[8]  Selma Sabanovic,et al.  Robots in Society, Society in Robots , 2010, Int. J. Soc. Robotics.

[9]  BEN GREEN,et al.  The Principles and Limits of Algorithm-in-the-Loop Decision Making , 2019, Proc. ACM Hum. Comput. Interact..

[10]  Dympna O'Sullivan,et al.  The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems , 2015, 2015 International Conference on Healthcare Informatics.

[11]  Hans Knutsson,et al.  Reinforcement Learning Trees , 1996 .

[12]  Pouya Samangouei,et al.  ExplainGAN: Model Explanation via Decision Boundary Crossing Transformations , 2018, ECCV.

[13]  K. Gummadi,et al.  Human Decision Making with Machine Assistance , 2019, Proc. ACM Hum. Comput. Interact..

[14]  Bertrand Michel,et al.  Grouped variable importance with random forests and application to multiple functional data analysis , 2014, Comput. Stat. Data Anal..

[15]  Arvind Satyanarayan,et al.  Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs , 2021, CHI.

[16]  J. H. Davis,et al.  An Integrative Model Of Organizational Trust , 1995 .

[17]  Blase Ur,et al.  An empirical study on the perceived fairness of realistic, imperfect machine learning models , 2020, FAT*.

[18]  Enrico Costanza,et al.  Evaluating saliency map explanations for convolutional neural networks: a user study , 2020, IUI.

[19]  Martijn Millecamp,et al.  Visual, textual or hybrid: the effect of user expertise on different explanations , 2021, IUI.

[20]  Gilles Bailly,et al.  How to Evaluate Trust in AI-Assisted Decision Making? A Survey of Empirical Methodologies , 2021, Proc. ACM Hum. Comput. Interact..

[21]  Krzysztof Z. Gajos,et al.  Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems , 2020, IUI.

[22]  John D. Lee,et al.  Trust in Automation: Designing for Appropriate Reliance , 2004, Hum. Factors.

[23]  Brian Y. Lim,et al.  COGAM: Measuring and Moderating Cognitive Load in Machine Learning Model Explanations , 2020, CHI.

[24]  Paul N. Bennett,et al.  Will You Accept an Imperfect AI?: Exploring Designs for Adjusting End-user Expectations of AI Systems , 2019, CHI.

[25]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[26]  Ryanne A. Brown,et al.  Impact of a deep learning assistant on the histopathologic classification of liver cancer , 2020, npj Digital Medicine.

[27]  Hany Farid,et al.  The accuracy, fairness, and limits of predicting recidivism , 2018, Science Advances.

[28]  Mohit Bansal,et al.  Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? , 2020, ACL.

[29]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[30]  Hui Zou,et al.  Insurance Premium Prediction via Gradient Tree-Boosted Tweedie Compound Poisson Models , 2015, 1508.06378.

[31]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[32]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[33]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[34]  Zahra Ashktorab,et al.  Mental Models of AI Agents in a Cooperative Game Setting , 2020, CHI.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Todd Kulesza,et al.  Tell me more?: the effects of mental model soundness on personalizing an intelligent agent , 2012, CHI.

[37]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[38]  Shi Feng,et al.  What can AI do for me?: evaluating machine learning interpretations in cooperative play , 2019, IUI.

[39]  J. F. Kelley,et al.  An empirical methodology for writing user-friendly natural language computer applications , 1983, CHI '83.

[40]  Miroslav Dudík,et al.  Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? , 2018, CHI.

[41]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[42]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[43]  Richard B. Berlin,et al.  A Slow Algorithm Improves Users' Assessments of the Algorithm's Accuracy , 2019, Proc. ACM Hum. Comput. Interact..

[44]  Sonia Chernova,et al.  Leveraging rationales to improve human task performance , 2020, IUI.

[45]  Alexandra Chouldechova,et al.  A Case for Humans-in-the-Loop: Decisions in the Presence of Erroneous Algorithmic Scores , 2020, CHI.

[46]  Hongyuan Zha,et al.  Visualizing Uncertainty and Alternatives in Event Sequence Predictions , 2019, CHI.

[47]  Arvind Satyanarayan,et al.  Assessing the Impact of Automated Suggestions on Decision Making: Domain Experts Mediate Model Errors but Take Less Initiative , 2021, CHI.

[48]  Hedvig Kjellström,et al.  Advances in Variational Inference , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Kate Ehrlich,et al.  Taking advice from intelligent systems: the double-edged sword of explanations , 2011, IUI '11.

[50]  Daniel S. Weld,et al.  No Explainability without Accountability: An Empirical Study of Explanations and Feedback in Interactive ML , 2020, CHI.

[51]  Anind K. Dey,et al.  Why and why not explanations improve the intelligibility of context-aware intelligent systems , 2009, CHI.

[52]  Mykola Pechenizkiy,et al.  A Human-Grounded Evaluation of SHAP for Alert Processing , 2019, ArXiv.

[53]  Kathleen McKeown,et al.  Human-Centric Justification of Machine Learning Predictions , 2017, IJCAI.

[54]  Alexandre Bernardino,et al.  Co-Design and Evaluation of an Intelligent Decision Support System for Stroke Rehabilitation Assessment , 2020, Proc. ACM Hum. Comput. Interact..

[55]  Tim Miller,et al.  Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI , 2020, ArXiv.

[56]  Dallas Card,et al.  The Values Encoded in Machine Learning Research , 2021, ArXiv.

[57]  John M. Carroll,et al.  Exploring and Promoting Diagnostic Transparency and Explainability in Online Symptom Checkers , 2021, CHI.

[58]  Ben Green The Flaws of Policies Requiring Human Oversight of Government Algorithms , 2021, SSRN Electronic Journal.

[59]  Lauren Wilcox,et al.  A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy , 2020, CHI.

[60]  Vibhav Gogate,et al.  Anchoring Bias Affects Mental Model Formation and User Reliance in Explainable AI Systems , 2021, IUI.

[61]  Devi Parikh,et al.  Do explanations make VQA models more predictable to a human? , 2018, EMNLP.

[62]  Haiyi Zhu,et al.  Explaining Decision-Making Algorithms through UI: Strategies to Help Non-Expert Stakeholders , 2019, CHI.

[63]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[64]  Jürgen Ziegler,et al.  Let Me Explain: Impact of Personal and Impersonal Explanations on Trust in Recommender Systems , 2019, CHI.

[65]  Kori Inkpen Quinn,et al.  What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring , 2019, HCOMP.

[66]  Ming Yin,et al.  Understanding the Effect of Accuracy on Trust in Machine Learning Models , 2019, CHI.

[67]  Pradeep Ravikumar,et al.  Representer Point Selection for Explaining Deep Neural Networks , 2018, NeurIPS.

[68]  Minsuk Kahng,et al.  Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers , 2018, IEEE Transactions on Visualization and Computer Graphics.

[69]  Jean Scholtz,et al.  How do visual explanations foster end users' appropriate trust in machine learning? , 2020, IUI.

[70]  Harmanpreet Kaur,et al.  Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning , 2020, CHI.

[71]  Jordan L. Boyd-Graber,et al.  Closing the Loop: User-Centered Design and Evaluation of a Human-in-the-Loop Topic Modeling System , 2018, IUI.

[72]  Rachel K. E. Bellamy,et al.  Explaining models an empirical study of how explanations impact fairness judgment , 2019 .

[73]  Kasper Hornbæk,et al.  Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? , 2000, CHI.

[74]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[75]  Fang Chen,et al.  Do I trust my machine teammate?: an investigation from perception to decision , 2019, IUI.

[76]  Christian Etmann,et al.  Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans , 2020 .

[77]  Carlos Eduardo Scheidegger,et al.  Assessing the Local Interpretability of Machine Learning Models , 2019, ArXiv.

[78]  Lauren Wilcox,et al.  "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making , 2019, Proc. ACM Hum. Comput. Interact..

[79]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[80]  Francesca Toni,et al.  Explanation-Based Human Debugging of NLP Models: A Survey , 2021, Transactions of the Association for Computational Linguistics.

[81]  Sean A. Munson,et al.  When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems , 2016, CHI.

[82]  Zhuoran Lu,et al.  Human Reliance on Machine Learning Models When Performance Feedback is Limited: Heuristics and Risks , 2021, CHI.

[83]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[84]  John Zimmerman,et al.  Investigating the Heart Pump Implant Decision Process: Opportunities for Decision Support Tools to Help , 2016, CHI.

[85]  Xing Shi,et al.  Hafez: an Interactive Poetry Generation System , 2017, ACL.

[86]  Michael Chromik,et al.  I Think I Get Your Point, AI! The Illusion of Explanatory Depth in Explainable AI , 2021, IUI.

[87]  Qian Yang,et al.  Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design , 2020, CHI.

[88]  Ye Wang,et al.  Improving Content-based and Hybrid Music Recommendation using Deep Learning , 2014, ACM Multimedia.

[89]  Berkeley J. Dietvorst,et al.  Algorithm Aversion: People Erroneously Avoid Algorithms after Seeing Them Err , 2014, Journal of experimental psychology. General.

[90]  Daniel P. Siewiorek,et al.  A Human-AI Collaborative Approach for Clinical Decision Making on Rehabilitation Assessment , 2021, CHI.

[91]  Michael A. Rupp,et al.  Insights into Human-Agent Teaming: Intelligent Agent Transparency and Uncertainty , 2017 .

[92]  Michael Veale,et al.  Fairness and Accountability Design Needs for Algorithmic Support in High-Stakes Public Sector Decision-Making , 2018, CHI.

[93]  Alex Endert,et al.  Should We Trust (X)AI? Design Dimensions for Structured Experimental Evaluations , 2020, ArXiv.

[94]  Jun Zhao,et al.  'It's Reducing a Human Being to a Percentage': Perceptions of Justice in Algorithmic Decisions , 2018, CHI.

[95]  Yunfeng Zhang,et al.  Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making , 2020, FAT*.

[96]  Leon Sixt,et al.  When Explanations Lie: Why Many Modified BP Attributions Fail , 2019, ICML.

[97]  Qiaozhu Mei,et al.  Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity , 2020, ICWSM.

[98]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[99]  Q. Liao,et al.  Questioning the AI: Informing Design Practices for Explainable AI User Experiences , 2020, CHI.

[100]  Michael J. Muller,et al.  How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation , 2019, CHI.

[101]  Kori Inkpen Quinn,et al.  Investigating Human + Machine Complementarity for Recidivism Predictions , 2018, ArXiv.

[102]  Eric Horvitz,et al.  Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance , 2019, HCOMP.

[103]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[104]  M. de Rijke,et al.  Why does my model fail?: contrastive local explanations for retail forecasting , 2019, FAT*.

[105]  Bertrand Michel,et al.  Correlation and variable importance in random forests , 2013, Statistics and Computing.

[106]  Ben Green,et al.  Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments , 2019, FAT.

[107]  Jorge Gonçalves,et al.  Effect of Information Presentation on Fairness Perceptions of Machine Learning Predictors , 2021, CHI.

[108]  Oluwasanmi Koyejo,et al.  xGEMs: Generating Examplars to Explain Black-Box Models , 2018, ArXiv.

[109]  Eric Horvitz,et al.  Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff , 2019, AAAI.

[110]  Ming Yin,et al.  Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making , 2021, IUI.

[111]  H. Schielzeth Simple means to improve the interpretability of regression coefficients , 2010 .

[112]  Dustin Tran,et al.  Bayesian Layers: A Module for Neural Network Uncertainty , 2018, NeurIPS.

[113]  Yang Wang,et al.  Investigating User Confidence for Uncertainty Presentation in Predictive Decision Making , 2015, OZCHI.

[114]  Martin Wattenberg,et al.  Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making , 2019, CHI.

[115]  Andrea Bunt,et al.  Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency , 2021, CHI.

[116]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[117]  Kush R. Varshney,et al.  Uncertainty Quantification 360: A Holistic Toolkit for Quantifying and Communicating the Uncertainty of AI , 2021, ArXiv.

[118]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[119]  Chenhao Tan,et al.  Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making , 2021, Proc. ACM Hum. Comput. Interact..

[120]  Han Liu,et al.  "Why is 'Chicago' deceptive?" Towards Building Model-Driven Tutorials for Humans , 2020, CHI.

[121]  Yixuan Sun,et al.  A Survey of Human-in-the-loop for Machine Learning , 2021, ArXiv.

[122]  Joseph P. Simmons,et al.  Overcoming Algorithm Aversion: People Will Use Imperfect Algorithms If They Can (Even Slightly) Modify Them , 2016, Manag. Sci..

[123]  Weng-Keen Wong,et al.  Too much, too little, or just right? Ways explanations impact end users' mental models , 2013, 2013 IEEE Symposium on Visual Languages and Human Centric Computing.

[124]  Don A. Moore,et al.  Algorithm Appreciation: People Prefer Algorithmic To Human Judgment , 2018, Organizational Behavior and Human Decision Processes.

[125]  Cynthia Rudin,et al.  All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously , 2019, J. Mach. Learn. Res..

[126]  Yashar Mehdad,et al.  Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA , 2020, ArXiv.

[127]  Dong Nguyen,et al.  Comparing Automatic and Human Evaluation of Local Explanations for Text Classification , 2018, NAACL.

[128]  Carrie J. Cai,et al.  The effects of example-based explanations in a machine learning interface , 2019, IUI.

[129]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[130]  L. Cronbach,et al.  Construct validity in psychological tests. , 1955, Psychological bulletin.

[131]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[132]  Jongbin Jung,et al.  The limits of human predictions of recidivism , 2020, Science Advances.

[133]  Emily Chen,et al.  How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation , 2018, ArXiv.

[134]  C. H. Lawshe A QUANTITATIVE APPROACH TO CONTENT VALIDITY , 1975 .

[135]  Himabindu Lakkaraju,et al.  When Does Uncertainty Matter?: Understanding the Impact of Predictive Uncertainty in ML Assisted Decision Making , 2020, ArXiv.

[136]  A. Lo,et al.  Consumer Credit Risk Models Via Machine-Learning Algorithms , 2010 .

[137]  Emilee J. Rader,et al.  Explanations as Mechanisms for Supporting Algorithmic Transparency , 2018, CHI.

[138]  Vivian Lai,et al.  On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection , 2018, FAT.

[139]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[140]  Thomas Brox,et al.  Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , 2016, NIPS.

[141]  Steve Whittaker,et al.  Progressive Disclosure: Designing for Effective Transparency , 2018, ArXiv.

[142]  Noah A. Smith,et al.  Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories , 2018, IUI.

[143]  Duen Horng Chau,et al.  A Survey of Human‐Centered Evaluations in Human‐Centered Machine Learning , 2021, Comput. Graph. Forum.

[144]  Kush R. Varshney,et al.  Increasing Trust in AI Services through Supplier's Declarations of Conformity , 2018, IBM J. Res. Dev..