论文信息 - P3O: Policy-on Policy-off Policy Optimization

P3O: Policy-on Policy-off Policy Optimization

On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is effective in reducing the sample complexity of state-of-the-art algorithms. Code to reproduce experiments in this paper is at this https URL.

[1] Philip Thomas,et al. Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[2] S. Resnick. A Probability Path , 1999 .

[3] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[4] Leonid Peshkin,et al. Learning from Scarce Experience , 2002, ICML.

[5] Richard E. Turner,et al. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[6] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[7] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[8] Tzuu-Hseng S. Li,et al. Backward Q-learning: The combination of Sarsa algorithm and Q-learning , 2013, Eng. Appl. Artif. Intell..

[9] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[10] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[11] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[12] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14] Koray Kavukcuoglu,et al. Combining policy gradient and Q-learning , 2016, ICLR.

[15] Sergey Levine,et al. PLATO: Policy learning using adaptive trajectory optimization , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[16] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[17] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[18] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[19] Satinder Singh,et al. Self-Imitation Learning , 2018, ICML.

[20] Doina Precup,et al. Policy Gradient Methods for Off-policy Control , 2015, ArXiv.

[21] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[22] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[23] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .

[24] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.