Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions

Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education, but safe deployment in high stakes settings requires ways of assessing its validity. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. In this paper we develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates. This is accomplished by highlighting observations in the data whose removal will have a large effect on the OPE estimate, and formulating a set of rules for choosing which ones to present to domain experts for validation. We develop methods to compute exactly the influence functions for fitted Q-evaluation with two different function classes: kernel-based and linear least squares, as well as importance sampling methods. Experiments on medical simulations and real-world intensive care unit data demonstrate that our method can be used to identify limitations in the evaluation process and make evaluation more robust.

[1]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[2]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[3]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  H. O. Oudemans-van Straaten,et al.  Unexplained mortality differences between septic shock trials: a systematic analysis of population characteristics and control-group mortality rates , 2018, Intensive Care Medicine.

[6]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[7]  Johan Pallud,et al.  A Tumor Growth Inhibition Model for Low-Grade Glioma Treated with Chemotherapy or Radiotherapy , 2012, Clinical Cancer Research.

[8]  Yuriy Brun,et al.  Preventing undesirable behavior of intelligent machines , 2019, Science.

[9]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[10]  I. Kohane,et al.  Biases in electronic health record data due to processes within the healthcare system: retrospective observational study , 2018, British Medical Journal.

[11]  Aldo A. Faisal,et al.  The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care , 2018, Nature Medicine.

[12]  Claude Guerin,et al.  High versus low blood-pressure target in patients with septic shock. , 2014, The New England journal of medicine.

[13]  A. Girbes,et al.  Time to stop randomized and large pragmatic trials for intensive care medicine syndromes: the case of sepsis and acute respiratory distress syndrome. , 2020, Journal of thoracic disease.

[14]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[15]  Srivatsan Srinivasan,et al.  Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[16]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[17]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[18]  Yao Liu,et al.  Combining Parametric and Nonparametric Models for Off-Policy Evaluation , 2019, ICML.

[19]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[20]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[21]  Samuel J. Gershman,et al.  Human-in-the-Loop Interpretability Prior , 2018, NeurIPS.

[22]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[23]  S. Weisberg,et al.  Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression , 1980 .

[24]  Peter Stone,et al.  Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[25]  Alan E Jones,et al.  Emergency department hypotension predicts sudden unexpected in-hospital mortality: a prospective cohort study. , 2006, Chest.

[26]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[27]  Adam Tauman Kalai,et al.  Adaptively Learning the Crowd Kernel , 2011, ICML.