On- and Off-Policy Monotonic Policy Improvement

Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this paper, by lower bounding the performance difference of two policies, we show that the monotonic policy improvement is guaranteed from on- and off-policy mixture samples. An optimization procedure which applies the proposed bound can be regarded as an off-policy natural policy gradient method. In order to support the theoretical result, we provide a trust region policy optimization method using experience replay as a naive application of our bound, and evaluate its performance in two classical benchmark problems.

[1]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[2]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[3]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[4]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[5]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[6]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[7]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[8]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[9]  Stephen J. Wright,et al.  A Fast and Reliable Policy Improvement Algorithm , 2016, AISTATS.

[10]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[11]  Jun Morimoto,et al.  Trial and Error: Using Previous Experiences as Simulation Models in Humanoid Motor Learning , 2016, IEEE Robotics & Automation Magazine.

[12]  Marc G. Bellemare,et al.  Q($\lambda$) with Off-Policy Corrections , 2016 .

[13]  L. V. D. Heyden,et al.  Perturbation bounds for the stationary probabilities of a finite Markov chain , 1984 .

[14]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[15]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[16]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[17]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[18]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[19]  Paul Wagner,et al.  A reinterpretation of the policy oscillation phenomenon in approximate policy iteration , 2011, NIPS.

[20]  Paul Wagner,et al.  Policy oscillation is overshooting , 2014, Neural Networks.

[21]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[22]  Jun Morimoto,et al.  Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.

[23]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[26]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[27]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[28]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.