"The human body is a black box": supporting clinical decision-making with deep learning

Machine learning technologies are increasingly developed for use in healthcare. While research communities have focused on creating state-of-the-art models, there has been less focus on real world implementation and the associated challenges to fairness, transparency, and accountability that come from actual, situated use. Serious questions remain underexamined regarding how to ethically build models, interpret and explain model output, recognize and account for biases, and minimize disruptions to professional expertise and work cultures. We address this gap in the literature and provide a detailed case study covering the development, implementation, and evaluation of Sepsis Watch, a machine learning-driven tool that assists hospital clinicians in the early diagnosis and treatment of sepsis. Sepsis is a severe infection that can lead to organ failure or death if not treated in time and is the leading cause of inpatient deaths in US hospitals. We, the team that developed and evaluated the tool, discuss our conceptualization of the tool not as a model deployed in the world but instead as a socio-technical system requiring integration into existing social and professional contexts. Rather than focusing solely on model interpretability to ensure fair and accountable machine learning, we point toward four key values and practices that should be considered when developing machine learning to support clinical decision-making: rigorously define the problem in context, build relationships with stakeholders, respect professional discretion, and create ongoing feedback loops with stakeholders. Our work has significant implications for future research regarding mechanisms of institutional accountability and considerations for responsibly designing machine learning systems. Our work underscores the limits of model interpretability as a solution to ensure transparency, accuracy, and accountability in practice. Instead, our work demonstrates other means and goals to achieve FATML values in design and in practice.

[1]  Mitchell M. Levy,et al.  A Framework for the Development and Interpretation of Different Sepsis Definitions and Clinical Criteria , 2016, Critical care medicine.

[2]  Jeremy C. Weiss,et al.  Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis. , 2019, JAMA.

[3]  David C. Kale,et al.  Modeling Missing Data in Clinical Time Series with RNNs , 2016 .

[4]  Susan Gruber,et al.  Incidence and Trends of Sepsis in US Hospitals Using Clinical vs Claims Data, 2009-2014 , 2017, JAMA.

[5]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[6]  Jeffrey Dean,et al.  Machine Learning in Medicine , 2019, The New England journal of medicine.

[7]  Trisha Greenhalgh,et al.  Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies , 2017, Journal of medical Internet research.

[8]  Solon Barocas,et al.  Problem Formulation and Fairness , 2019, FAT.

[9]  Jenna Wiens,et al.  A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions , 2014, J. Am. Medical Informatics Assoc..

[10]  Zachary C. Lipton,et al.  The Doctor Just Won't Accept That! , 2017, 1711.08037.

[11]  Marzyeh Ghassemi,et al.  Rethinking clinical prediction: Why machine learning must consider year of care and feature aggregation , 2018, ArXiv.

[12]  E. Rogers Diffusion of Innovations , 1962 .

[13]  Barrett T. Kitch,et al.  Compliance With the National SEP-1 Quality Measure and Association With Sepsis Outcomes: A Multicenter Retrospective Cohort Study* , 2018, Critical care medicine.

[14]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[15]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[16]  Suchi Saria,et al.  Tutorial: Safe and Reliable Machine Learning , 2019, ArXiv.

[17]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[18]  S. Lemeshow,et al.  Time to Treatment and Mortality during Mandated Emergency Care for Sepsis , 2017, The New England journal of medicine.

[19]  Frank A. Pasquale,et al.  [89WashLRev0001] The Scored Society: Due Process for Automated Predictions , 2014 .

[20]  K. Heller,et al.  Evaluating sepsis definitions for clinical decision support against a definition for epidemiological disease surveillance , 2019, bioRxiv.

[21]  Solon Barocas,et al.  The Intuitive Appeal of Explainable Machines , 2018 .

[22]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[23]  Gary S Collins,et al.  Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement , 2015, BMC Medicine.

[24]  Luciano Floridi,et al.  Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation , 2017 .

[25]  Jimeng Sun,et al.  Using recurrent neural network models for early detection of heart failure onset , 2016, J. Am. Medical Informatics Assoc..

[26]  Kadija Ferryman,et al.  Fairness in precision medicine , 2018 .

[27]  Gina Neff,et al.  Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science , 2017, Big Data.

[28]  Jenna Burrell,et al.  How the machine ‘thinks’: Understanding opacity in machine learning algorithms , 2016 .

[29]  C. E. Haupt Artificial Professional Advice , 2019 .

[30]  Zachary C. Lipton,et al.  The mythos of model interpretability , 2018, Commun. ACM.

[31]  Michael Gao,et al.  Real-World Integration of a Sepsis Deep Learning Technology Into Routine Clinical Care: Implementation Study , 2019, JMIR medical informatics.

[32]  Ben Green,et al.  Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments , 2019, FAT.

[33]  Rebecca C. Steorts,et al.  Minimal Impact of Implemented Early Warning Score and Best Practice Alert for Patient Deterioration* , 2019, Critical care medicine.

[34]  Richard J. Holden,et al.  The Technology Acceptance Model: Its past and its future in health care , 2010, J. Biomed. Informatics.

[35]  Douglas G. Altman,et al.  Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) , 2015, Circulation.

[36]  A. Wall,et al.  Book ReviewTo Err is Human: building a safer health system Kohn L T Corrigan J M Donaldson M S Washington DC USA: Institute of Medicine/National Academy Press ISBN 0 309 06837 1 $34.95 , 2000 .

[37]  Deirdre K. Mulligan,et al.  Shaping Our Tools: Contestability as a Means to Promote Responsible Algorithmic Decision Making in the Professions , 2019 .

[38]  Madeleine Clare Elish,et al.  AI in context: the labor of integrating new technologies , 2019 .

[39]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[40]  Paul Heidenreich,et al.  Electronic health record-based clinical decision support alert for severe sepsis: a randomised evaluation , 2019, BMJ Quality & Safety.

[41]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[42]  Madeleine Clare Elish,et al.  The Stakes of Uncertainty: Developing and Integrating Machine Learning in Clinical Care , 2018, Ethnographic Praxis in Industry Conference Proceedings.

[43]  Katherine A. Heller,et al.  An Improved Multi-Output Gaussian Process RNN with Real-Time Validation for Early Sepsis Detection , 2017, MLHC.

[44]  Jenna Wiens,et al.  A Generalizable, Data-Driven Approach to Predict Daily Risk of Clostridium difficile Infection at Two Large Academic Health Centers , 2018, Infection Control & Hospital Epidemiology.

[45]  Chris Russell,et al.  Explaining Explanations in AI , 2018, FAT.

[46]  Jie Xu,et al.  The practical implementation of artificial intelligence technologies in medicine , 2019, Nature Medicine.

[47]  D. Berwick Disseminating innovations in health care. , 2003, JAMA.

[48]  L. Kohn,et al.  To Err Is Human : Building a Safer Health System , 2007 .

[49]  J. Reidenberg,et al.  Accountable Algorithms , 2016 .

[50]  Guanhua Chen,et al.  Calibration drift in regression and machine learning models for acute kidney injury , 2017, J. Am. Medical Informatics Assoc..

[51]  N. Shah,et al.  What This Computer Needs Is a Physician: Humanism and Artificial Intelligence. , 2018, Journal of the American Medical Association (JAMA).

[52]  M. Churpek,et al.  Patient Outcomes and Cost-Effectiveness of a Sepsis Care Quality Improvement Program in a Health System. , 2019, Critical care medicine.

[53]  Everett M. Rogers,et al.  Diffusion of Innovations 1 , 2019, An Integrated Approach to Communication Theory and Research.

[54]  A. Strauss,et al.  Grounded Theory in Practice , 1997 .

[55]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[56]  Lynne E. Parker,et al.  Creation of the National Artificial Intelligence Research and Development Strategic Plan , 2018, AI Mag..

[57]  Eric J Topol,et al.  High-performance medicine: the convergence of human and artificial intelligence , 2019, Nature Medicine.

[58]  J. Kotter Leading change: why transformation efforts fail , 2009, IEEE Engineering Management Review.

[59]  W Nicholson Price,et al.  Regulating Black-Box Medicine. , 2017, Michigan law review.

[60]  Michael Gao,et al.  Machine Learning in Health Care: A Critical Appraisal of Challenges and Opportunities , 2019, EGEMS.

[61]  Marcus A. Badgeley,et al.  Confounding variables can degrade generalization performance of radiological deep learning models , 2018, ArXiv.

[62]  A Closer Look at Sepsis-Associated Mortality. , 2019, JAMA network open.

[63]  Katherine A. Heller,et al.  Learning to Detect Sepsis with a Multitask Gaussian Process RNN Classifier , 2017, ICML.

[64]  G. Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement , 2015, Annals of Internal Medicine.