Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit{extrapolation error}. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a \textit{state distribution shift} due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel off-policy policy optimization method to constraint the state distribution shift. To do this, we first estimate the state distribution based on features of the state, using a density estimator and then develop a novel constrained off-policy gradient objective that minimizes the state distribution shift. Our experimental results on continuous control tasks show that minimizing this distribution mismatch can significantly improve performance in most popular practical off-policy policy gradient algorithms.

[1]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[2]  Martha White,et al.  An Off-policy Policy Gradient Theorem Using Emphatic Weightings , 2018, NeurIPS.

[3]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[4]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[5]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[6]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[7]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[8]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[11]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[12]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[13]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[17]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.