Nested Policy Reinforcement Learning

Off-policy reinforcement learning (RL) has proven to be a powerful framework for guiding agents’ actions in environments with stochastic rewards and unknown or noisy state dynamics. In many real-world settings, these agents must operate in multiple environments, each with slightly different dynamics. For example, we may be interested in developing policies to guide medical treatment for patients with and without a given disease, or policies to navigate curriculum design for students with and without a learning disability. Here, we introduce nested policy fitted Q-iteration (NFQI), an RL framework that finds optimal policies in environments that exhibit such a structure. Our approach develops a nested Q-value function that takes advantage of the shared structure between two groups of observations from two separate environments while allowing their policies to be distinct from one another. We find that NFQI yields policies that rely on relevant features and perform at least as well as a policy that does not consider group structure. We demonstrate NFQI’s performance using an OpenAI Gym environment and a clinical decision making RL task. Our results suggest that NFQI can develop policies that are better suited to many real-world clinical environments.

[1]  Yi Wu,et al.  Multi-Task Reinforcement Learning with Soft Modularization , 2020, NeurIPS.

[2]  Heike Freud,et al.  On Line Learning In Neural Networks , 2016 .

[3]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[4]  G. Moody,et al.  A database to support development and evaluation of intelligent intensive care monitoring , 1996, Computers in Cardiology 1996.

[5]  Michael Biehl,et al.  On-line Learning in Neural Networks , 1998 .

[6]  Jiayu Zhou,et al.  Transfer Learning in Deep Reinforcement Learning: A Survey , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Peter E. Rossi,et al.  Hierarchical Bayes Models: A Practitioners Guide , 2005 .

[8]  Sergey Levine,et al.  Offline Meta-Reinforcement Learning with Advantage Weighting , 2020, ICML.

[9]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[10]  R G Mark,et al.  MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring , 2002, Computers in Cardiology.

[11]  R. Mclean,et al.  A Unified Approach to Mixed Linear Models , 1991 .

[12]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[13]  Joelle Pineau,et al.  Multi-Task Reinforcement Learning with Context-based Representations , 2021, ICML.

[14]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[15]  B. Brandstrup,et al.  Fluid therapy in the perioperative setting—a clinical review , 2016, Journal of Intensive Care.

[16]  Ryan P. Adams,et al.  On Warm-Starting Neural Network Training , 2020, NeurIPS.

[17]  Barbara E. Engelhardt,et al.  COP-E-CAT: cleaning and organization pipeline for EHR computational and analytic tasks , 2021, BCB.

[18]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[19]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[20]  Niranjani Prasad,et al.  Methods for Reinforcement Learning in Clinical Decision Support , 2020 .

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[24]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.