Feature Extraction for Effective and Efficient Deep Reinforcement Learning on Real Robotic Platforms

Deep reinforcement learning (DRL) methods can solve complex continuous control tasks in simulated environments by taking actions based solely on state observations at each decision point. Because of the dynamics involved, individual snapshots of real-world sensor measurements afford only partial state observability, so it is typical to use a history of observations to improve training and policy performance. Such intertemporal information can be further exploited using a recurrent neural network (RNN) to reduce the dimensionality of the dynamic state representation. However, using RNNs as an internal part of a DRL network presents challenges of its own; and even then, the improvements in resulting policies are usually limited. To address these shortcomings, we propose using gated feature extraction to improve DRL training of real-world robots. Specifically, we use an untrained gated recurrent unit (GRU) to encode a low-dimension representation of the state observation sequence before passing it to the DRL training procedure. In addition to dimensionality reduction, this allows us to unroll the RNN by encoding the observations cumulatively as they are collected, thereby avoiding same-length input requirements, and train the RL network on the raw observations at the current step combined with the GRU-encoding of the preceding steps. Our simulation experiments employ gated feature extraction with the TD3 algorithm. Our results show that the GRU-encoded state observations improve the training speed and execution performance of the TD3 algorithm, improving the learned policies in all 19 test cases, exceeding the maximum achieved reward by over 38% in 8 and doubling the maximum achieved reward in three, while also outperforming a baseline implementation of SAC in 17 out of 19 environments. Moreover, the greatest improvement is seen in real-world experiments, where our approach successfully learns to balance a pendulum as well as a complex quadrupedal locomotion task. In contrast, the standard TD3 algorithm not only does not show any learning progress at all, but also repeatedly damages the hardware.

[1]  Xingye Da,et al.  Dynamics Randomization Revisited: A Case Study for Quadrupedal Locomotion , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Lorenz Wellhausen,et al.  Learning quadrupedal locomotion over challenging terrain , 2020, Science Robotics.

[3]  S. Levine,et al.  Learning to Walk in the Real World with Minimal Human Effort , 2020, CoRL.

[4]  J. Ortega,et al.  Approximation Bounds for Random Neural Networks and Reservoir Systems , 2020, The Annals of Applied Probability.

[5]  Clarence W. de Silva,et al.  Teach Biped Robots to Walk via Gait Principles and Reinforcement Learning with Adversarial Critics , 2019, ArXiv.

[6]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[7]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[8]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[9]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[10]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[11]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[12]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[13]  P. Poupart,et al.  On Improving Deep Reinforcement Learning for POMDPs , 2017, ArXiv.

[14]  J. Schulman,et al.  OpenAI Gym , 2016, ArXiv.

[15]  Mikhail Pavlov,et al.  Deep Attention Recurrent Q-Network , 2015, ArXiv.

[16]  Jianfeng Gao,et al.  Recurrent Reinforcement Learning: A Hybrid Approach , 2015, ArXiv.

[17]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[18]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[19]  Kris K. Hauser,et al.  Robust trajectory optimization under frictional contact with iterative learning , 2015, Auton. Robots.

[20]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[26]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[27]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[29]  Martin A. Riedmiller,et al.  Towards Real Robot Learning in the Wild: A Case Study in Bipedal Locomotion , 2021, CoRL.

[30]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..