P3O: Policy-on Policy-off Policy Optimization

On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms. Code to reproduce experiments in this paper is at this https URL.

[1]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[2]  S. Resnick A Probability Path , 1999 .

[3]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[4]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[5]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[6]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[7]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[8]  Tzuu-Hseng S. Li,et al.  Backward Q-learning: The combination of Sarsa algorithm and Q-learning , 2013, Eng. Appl. Artif. Intell..

[9]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[10]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[11]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[12]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  Koray Kavukcuoglu,et al.  Combining policy gradient and Q-learning , 2016, ICLR.

[15]  Sergey Levine,et al.  PLATO: Policy learning using adaptive trajectory optimization , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[19]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[20]  Doina Precup,et al.  Policy Gradient Methods for Off-policy Control , 2015, ArXiv.

[21]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[22]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[23]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[24]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[25]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[26]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[29]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.

[30]  Pieter Abbeel,et al.  On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient , 2010, NIPS.

[31]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[32]  C. Robert,et al.  Rethinking the Effective Sample Size , 2018, International Statistical Review.

[33]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[34]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[35]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[38]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[39]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[40]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[41]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[42]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[43]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[44]  Xi-Ren Cao,et al.  A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.

[45]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[46]  Dengyong Zhou,et al.  Action-depedent Control Variates for Policy Optimization via Stein's Identity , 2017 .

[47]  Matthew Hausknecht and Peter Stone On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning , 2016 .

[48]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[49]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[50]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[51]  Larry Rudolph,et al.  Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.