论文信息 - Implementation Matters in Deep RL: A Case Study on PPO and TRPO

Implementation Matters in Deep RL: A Case Study on PPO and TRPO

We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms, Proximal Policy Optimization and Trust Region Policy Optimization. We investigate the consequences of "code-level optimizations:" algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm. Seemingly of secondary importance, such optimizations have a major impact on agent behavior. Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function. These insights show the difficulty, and importance, of attributing performance gains in deep reinforcement learning.

[1] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2] Benjamin Recht,et al. Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[3] Sergey Levine,et al. The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[4] Sham M. Kakade,et al. Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[5] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[7] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[8] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[9] Joelle Pineau,et al. Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods , 2018, ArXiv.

[10] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[13] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[14] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.