Proximal Policy Optimization via Enhanced Exploration Efficiency

Proximal policy optimization (PPO) algorithm is a deep reinforcement learning algorithm with outstanding performance, especially in continuous control tasks. But the performance of this method is still affected by its exploration ability. For classical reinforcement learning, there are some schemes that make exploration more full and balanced with data exploitation, but they can't be applied in complex environments due to the complexity of algorithm. Based on continuous control tasks with dense reward, this paper analyzes the assumption of the original Gaussian action exploration mechanism in PPO algorithm, and clarifies the influence of exploration ability on performance. Afterward, aiming at the problem of exploration, an exploration enhancement mechanism based on uncertainty estimation is designed in this paper. Then, we apply exploration enhancement theory to PPO algorithm and propose the proximal policy optimization algorithm with intrinsic exploration module (IEM-PPO) which can be used in complex environments. In the experimental parts, we evaluate our method on multiple tasks of MuJoCo physical simulator, and compare IEM-PPO algorithm with curiosity driven exploration algorithm (ICM-PPO) and original algorithm (PPO). The experimental results demonstrate that IEM-PPO algorithm needs longer training time, but performs better in terms of sample efficiency and cumulative reward, and has stability and robustness.

[1]  Daochen Zha,et al.  Experience Replay Optimization , 2019, IJCAI.

[2]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[3]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[6]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[7]  Stefano Ermon,et al.  Multi-Agent Generative Adversarial Imitation Learning , 2018, NeurIPS.

[8]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[9]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[10]  Baochun Li,et al.  Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization , 2018, NeurIPS.

[11]  Junxiang Li,et al.  Deep reinforcement learning for pedestrian collision avoidance and human-machine cooperative driving , 2020, Inf. Sci..

[12]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[13]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[14]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[15]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[16]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[17]  Hongbing Wang,et al.  A multi-agent reinforcement learning approach to dynamic service composition , 2016 .

[18]  Hao Wu,et al.  Mastering Complex Control in MOBA Games with Deep Reinforcement Learning , 2019, AAAI.

[19]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[21]  Larry Rudolph,et al.  Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.

[22]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Emanuel Todorov,et al.  Convex and analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[26]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[27]  James Bergstra,et al.  Benchmarking Reinforcement Learning Algorithms on Real-World Robots , 2018, CoRL.

[28]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[29]  Xin Xu,et al.  Reinforcement learning algorithms with function approximation: Recent advances and applications , 2014, Inf. Sci..

[30]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[31]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.