Distributed Fusion-Based Policy Search for Fast Robot Locomotion Learning

Deep reinforcement learning methods are developed to deal with challenging locomotion control problems in a robotics domain and can achieve significant performance improvement over conventional control methods. One of their appealing advantages is model-free. In other words, agents learn a control policy completely from scratches with raw high-dimensional sensory observations. However, they often suffer from poor sample-efficiency and instability issues, which make them inapplicable to many engineering systems. This paper presents a distributed fusion-based policy search framework to accelerate robot locomotion learning processes through variance reduction and asynchronous exploration approaches. An adaptive fusion-based variance reduction technique is introduced to improve sample-efficiency. A parametric noise is added to neural network weights, which leads to efficient exploration and ensures consistency in actions. Subsequently, the fusion-based policy gradient estimator is extended to a distributed decoupled actor-critic architecture. This allows the central estimator to handle off-policy data from different actors asynchronously, which fully utilizes CPUs and GPUs to maximize data throughput. The aim of this work is to improve sample-efficiency and convergence speed of deep reinforcement learning in robot locomotion tasks. Simulation results are presented to verify the theoretical results, which show that the proposed algorithm achieves and sometimes surpasses the state-of-theart performance.

[1]  Matthias Hein,et al.  Variants of RMSProp and Adagrad with Logarithmic Regret Bounds , 2017, ICML.

[2]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[3]  Liuqing Yang,et al.  Where does AlphaGo go: from church-turing thesis to AlphaGo thesis and beyond , 2016, IEEE/CAA Journal of Automatica Sinica.

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6]  Changyin Sun,et al.  An adaptive strategy via reinforcement learning for the prisoner U+02BC s dilemma game , 2018, IEEE/CAA Journal of Automatica Sinica.

[7]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[8]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[9]  Radu-Codrut David,et al.  Second Order Intelligent Proportional-Integral Fuzzy Control of Twin Rotor Aerodynamic Systems , 2018, ITQM.

[10]  Ioan-Daniel Borlea,et al.  Model-Free Sliding Mode and Fuzzy Controllers for Reverse Osmosis Desalination Plants , 2018 .

[11]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[12]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[15]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[16]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[17]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[18]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[19]  Auke J. Ijspeert,et al.  Biorobotics: Using robots to emulate and investigate agile locomotion , 2014, Science.

[20]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[21]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[22]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[24]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[25]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[26]  Yunfeng Ai,et al.  Parallel reinforcement learning: a framework and case study , 2018, IEEE/CAA Journal of Automatica Sinica.

[27]  Qichao Zhang,et al.  Model-Free Optimal Control Based Intelligent Cruise Control with Hardware-in-the-Loop Demonstration [Research Frontier] , 2017, IEEE Computational Intelligence Magazine.

[28]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[29]  Scott Kuindersma,et al.  Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot , 2015, Autonomous Robots.

[30]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[31]  Sergey Levine,et al.  Collective robot reinforcement learning with distributed asynchronous guided policy search , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[33]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[34]  Zhengcai Cao,et al.  Robust Neuro-Optimal Control of Underactuated Snake Robots With Experience Replay , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[35]  Florentin Wörgötter,et al.  Enhanced Locomotion Efficiency of a Bio-inspired Walking Robot using Contact Surfaces with Frictional Anisotropy , 2016, Scientific reports.

[36]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[37]  Hesheng Wang,et al.  A cable-driven soft robot surgical system for cardiothoracic endoscopic surgery: preclinical tests in animals , 2017, Surgical Endoscopy.

[38]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[39]  Jessy W. Grizzle,et al.  Supervised learning for stabilizing underactuated bipedal robot locomotion, with outdoor experiments on the wave field , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[40]  Sangbae Kim,et al.  Robot locomotion on hard and soft ground: Measuring stability and ground properties in-situ , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Howie Choset,et al.  Shape-based coordination in locomotion control , 2018, Int. J. Robotics Res..

[42]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[43]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.