Off-Policy Policy Gradient with State Distribution Correction

We study the problem of off-policy policy optimization in Markov decision processes, and develop a novel off-policy policy gradient method. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. Empirically, we compare our method in simulations to several strong baselines which do not correct for this mismatch, significantly improving in the quality of the policy discovered.

[1]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[2]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[3]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[4]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[5]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[6]  A. Preliminaries Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .

[7]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[8]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[9]  Shimon Whiteson,et al.  Generalized Off-Policy Actor-Critic , 2019, NeurIPS.

[10]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[11]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[12]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[13]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[14]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[15]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[17]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[18]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[19]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[20]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[23]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[24]  Martha White,et al.  An Off-policy Policy Gradient Theorem Using Emphatic Weightings , 2018, NeurIPS.

[25]  Yao Liu,et al.  Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[26]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[27]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[28]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[29]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[30]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[31]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[32]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[33]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.