Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies

Standard reinforcement learning (RL) aims to find an optimal policy that identifies the best action for each state. However, in healthcare settings, many actions may be near-equivalent with respect to the reward (e.g., survival). We consider an alternative objective -- learning set-valued policies to capture near-equivalent actions that lead to similar cumulative rewards. We propose a model-free algorithm based on temporal difference learning and a near-greedy heuristic for action selection. We analyze the theoretical properties of the proposed algorithm, providing optimality guarantees and demonstrate our approach on simulated environments and a real clinical task. Empirically, the proposed algorithm exhibits good convergence properties and discovers meaningful near-equivalent actions. Our work provides theoretical, as well as practical, foundations for clinician/human-in-the-loop decision making, in which humans (e.g., clinicians, patients) can incorporate additional knowledge (e.g., side effects, patient preference) when selecting among near-equivalent actions.

[1]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  Peter Szolovits,et al.  Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.

[4]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[5]  Shamim Nemati,et al.  Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 , 2019, 2019 Computing in Cardiology (CinC).

[6]  Daniel J. Lizotte,et al.  Multi-Objective Markov Decision Processes for Data-Driven Decision Support , 2016, J. Mach. Learn. Res..

[7]  M. Matthay,et al.  Sepsis: pathophysiology and clinical management , 2016, British Medical Journal.

[8]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[9]  Matthieu Komorowski,et al.  The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making , 2018, ArXiv.

[10]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[11]  Susan A. Murphy,et al.  Linear fitted-Q iteration with multiple reward functions , 2013, J. Mach. Learn. Res..

[12]  Michael Gao,et al.  Machine learning for early detection of sepsis: an internal and temporal validation study , 2020, JAMIA open.

[13]  Finale Doshi-Velez,et al.  Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies , 2019, IJCAI.

[14]  F. V. van Haren,et al.  Fluid resuscitation in human sepsis: Time to rewrite history? , 2017, Annals of Intensive Care.

[15]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[16]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[17]  G. Escobar,et al.  Hospital deaths in patients with sepsis from 2 independent cohorts. , 2014, JAMA.

[18]  Ashish Sharma,et al.  Early Prediction of Sepsis from Clinical Data: the PhysioNet/Computing in Cardiology Challenge 2019 , 2019, 2019 Computing in Cardiology (CinC).

[19]  Shamim Nemati,et al.  Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[20]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[21]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Joelle Pineau,et al.  Non-Deterministic Policies in Markovian Decision Processes , 2014, J. Artif. Intell. Res..

[24]  Aldo A. Faisal,et al.  The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care , 2018, Nature Medicine.

[25]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[26]  P. Pronovost,et al.  A targeted real-time early warning score (TREWScore) for septic shock , 2015, Science Translational Medicine.

[27]  Srivatsan Srinivasan,et al.  Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[28]  Peter Szolovits,et al.  Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach , 2017, MLHC.