POPCORN: Partially Observed Prediction COnstrained ReiNforcement Learning

Many medical decision-making tasks can be framed as partially observed Markov decision processes (POMDPs). However, prevailing two-stage approaches that first learn a POMDP and then solve it often fail because the model that best fits the data may not be well suited for planning. We introduce a new optimization objective that (a) produces both high-performing policies and high-quality generative models, even when some observations are irrelevant for planning, and (b) does so in batch off-policy settings that are typical in healthcare, when only retrospective data is available. We demonstrate our approach on synthetic examples and a challenging medical decision-making problem.

[1]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[2]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[3]  Jürgen Schmidhuber,et al.  A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients , 2009, Expert Syst. Appl..

[4]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[5]  Peter Szolovits,et al.  Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach , 2017, MLHC.

[6]  Christian Igel,et al.  Empirical evaluation of the improved Rprop learning algorithms , 2003, Neurocomputing.

[7]  Amir-massoud Farahmand,et al.  Iterative Value-Aware Model Learning , 2018, NeurIPS.

[8]  Joelle Pineau,et al.  Informing sequential clinical decision-making through reinforcement learning: an empirical study , 2010, Machine Learning.

[9]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[10]  Peter Szolovits,et al.  Predicting Blood Pressure Response to Fluid Bolus Therapy Using Attention-Based Neural Networks for Clinical Interpretability , 2018, ArXiv.

[11]  Andrew Rhodes,et al.  Effects of fluid administration on arterial load in septic shock patients , 2015, Intensive Care Medicine.

[12]  Guy Shani,et al.  Noname manuscript No. (will be inserted by the editor) A Survey of Point-Based POMDP Solvers , 2022 .

[13]  Aldo A. Faisal,et al.  The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care , 2018, Nature Medicine.

[14]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[15]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[16]  Finale Doshi-Velez,et al.  Combining Kernel and Model Based Learning for HIV Therapy Selection , 2017, CRI.

[17]  Yao Liu,et al.  Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters , 2018, ArXiv.

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  Barbara E. Engelhardt,et al.  A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.

[20]  M. Ghavamzadeh,et al.  Robust Policy Optimization with Baseline Guarantees , 2015, 1506.04514.

[21]  Nicholas Roy,et al.  Bayesian nonparametric approaches for reinforcement learning in partially observable domains , 2012 .

[22]  Claude Guerin,et al.  High versus low blood-pressure target in patients with septic shock. , 2014, The New England journal of medicine.

[23]  David Hsu,et al.  QMDP-Net: Deep Learning for Planning under Partial Observability , 2017, NIPS.

[24]  Finale Doshi-Velez,et al.  Semi-Supervised Prediction-Constrained Topic Models , 2018, AISTATS.

[25]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[26]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[27]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[28]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[29]  Andrew Slavin Ross,et al.  Improving Sepsis Treatment Strategies by Combining Deep and Kernel-Based Reinforcement Learning , 2018, AMIA.

[30]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[31]  Zoubin Ghahramani,et al.  Approximate inference for the loss-calibrated Bayesian , 2011, AISTATS.

[32]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[33]  Jesse Hoey,et al.  Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[34]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[35]  Shie Mannor,et al.  Off-Policy Evaluation in Partially Observable Environments , 2020, AAAI.

[36]  Matthieu Komorowski,et al.  The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making , 2018, ArXiv.

[37]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[38]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[39]  Milind Tambe,et al.  Melding the Data-Decisions Pipeline: Decision-Focused Learning for Combinatorial Optimization , 2018, AAAI.

[40]  Alan E Jones,et al.  Emergency department hypotension predicts sudden unexpected in-hospital mortality: a prospective cohort study. , 2006, Chest.