Proximal Parameter Distribution Optimization

Encouraging the agent to explore has become a hot topic in the field of reinforcement learning (RL). The popular approaches to engage in exploration are mainly by injecting noise into neural network (NN) parameters or by augmenting additional intrinsic motivation term. However, the randomness of injecting noise and the metric for intrinsic reward must be chosen manually may make RL agents deviate from the optimal policy during the learning process. To enhance the exploration ability of agent and simultaneously ensure the stability of parameter learning, we proposed a novel proximal parameter distribution optimization (PPDO) algorithm. On the one hand, PPDO enhances the exploration ability of RL agent by transforming NN parameter from a certain single value to a function distribution. On the other hand, PPDO accelerates the parameter distribution optimization by setting two groups of parameters. The parameter optimization is guided by evaluating the parameter quality change before and after the parameter distribution update. In addition, PPDO reduces the influence of bias and variance on the value function approximation by limiting the amplitude of the two consecutive parameter updates, which can enhance the stability of the parameter distribution optimization. Experiments on the OpenAI Gym, Atari, and MuJoCo platforms indicate that PPDO can improve the exploration ability and learning efficiency of deep RL algorithms, including DQN and A3C.

[1]  David He,et al.  Using Deep Learning-Based Approach to Predict Remaining Useful Life of Rotating Components , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[2]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[3]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[4]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[5]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[6]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[7]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[8]  Huaguang Zhang,et al.  Fault-Tolerant Controller Design for a Class of Nonlinear MIMO Discrete-Time Systems via Online Reinforcement Learning Algorithm , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[9]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[13]  S. Shankar Sastry,et al.  Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[14]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[15]  Xin Chen,et al.  Deep Learning-Based Model Reduction for Distributed Parameter Systems , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[16]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[17]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[18]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[19]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[20]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[22]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[24]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[29]  Qiang Yu,et al.  Multisource Transfer Double DQN Based on Actor Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[31]  Yang Li,et al.  Adaptive Neural Network Control of AUVs With Control Input Nonlinearities Using Reinforcement Learning , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[32]  Enrico Zio,et al.  A Reliability Assessment Framework for Systems With Degradation Dependency by Combining Binary Decision Diagrams and Monte Carlo Simulation , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[33]  Andrew G. Howard,et al.  Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[39]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[40]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..