论文信息 - Learning to Walk via Deep Reinforcement Learning

Learning to Walk via Deep Reinforcement Learning

Deep reinforcement learning (deep RL) holds the promise of automating the acquisition of complex controllers that can map sensory inputs directly to low-level actions. In the domain of robotic locomotion, deep RL could enable learning locomotion skills with minimal engineering and without an explicit model of the robot dynamics. Unfortunately, applying deep RL to real-world robotic tasks is exceptionally difficult, primarily due to poor sample complexity and sensitivity to hyperparameters. While hyperparameters can be easily tuned in simulated domains, tuning may be prohibitively expensive on physical systems, such as legged robots, that can be damaged through extensive trial-and-error learning. In this paper, we propose a sample-efficient deep RL algorithm based on maximum entropy RL that requires minimal per-task tuning and only a modest number of trials to learn neural network policies. We apply this method to learning walking gaits on a real-world Minitaur robot. Our method can acquire a stable gait from scratch directly in the real world in about two hours, without relying on any model or simulation, and the resulting policy is robust to moderate variations in the environment. We further show that our algorithm achieves state-of-the-art performance on simulated benchmarks with a single set of hyperparameters. Videos of training and the learned policy can be found on the project website.

[1] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[2] Roland Siegwart,et al. Practice Makes Perfect: An Optimization-Based Approach to Controlling Agile Motions for a Quadruped Robot , 2016, IEEE Robotics & Automation Magazine.

[3] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[4] Sergey Levine,et al. Composable Deep Reinforcement Learning for Robotic Manipulation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[5] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[7] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[8] Yuval Tassa,et al. Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[9] Christopher G. Atkeson,et al. Bayesian Optimization Using Domain Knowledge on the ATRIAS Biped , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[11] Nikolaus Hansen,et al. The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[12] Joonho Lee,et al. Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[13] Sehoon Ha,et al. Automated Deep Reinforcement Learning Environment for Hardware of a Modular Legged Robot , 2018, 2018 15th International Conference on Ubiquitous Robots (UR).

[14] Marc Toussaint,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[15] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[16] Ferdinando Cannella,et al. Design of HyQ – a hydraulically and electrically actuated quadruped robot , 2011 .

[17] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18] Shie Mannor,et al. A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[19] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[20] Hannes Sommer,et al. Quadrupedal locomotion using hierarchical operational space control , 2014, Int. J. Robotics Res..

[21] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[22] James Bergstra,et al. Benchmarking Reinforcement Learning Algorithms on Real-World Robots , 2018, CoRL.

[23] Emanuel Todorov,et al. General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[24] Marc Toussaint,et al. Robot trajectory optimization using approximate inference , 2009, ICML '09.

[25] Sangbae Kim,et al. MIT Cheetah 3: Design and Control of a Robust, Dynamic Quadruped Robot , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26] Roland Siegwart,et al. Towards automatic discovery of agile gaits for quadrupedal robots , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[27] Glen Berseth,et al. Feedback Control For Cassie With Deep Reinforcement Learning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[29] Atil Iscen,et al. Policies Modulating Trajectory Generators , 2018, CoRL.

[30] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[31] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[32] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[33] Sergey Levine,et al. Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[34] Sergey Levine,et al. Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning , 2017, ICLR.

[35] Glen Berseth,et al. Progressive Reinforcement Learning with Distillation for Multi-Skilled Motion Control , 2018, ICLR.

[36] H. Sebastian Seung,et al. Learning to Walk in 20 Minutes , 2005 .

[37] Glen Berseth,et al. Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[38] Jan Peters,et al. Bayesian optimization for learning gaits under uncertainty , 2015, Annals of Mathematics and Artificial Intelligence.

[39] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[40] Sangbae Kim,et al. Dynamic Locomotion in the MIT Cheetah 3 Through Convex Model-Predictive Control , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[41] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[42] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43] Peter Stone,et al. Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[44] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[45] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[46] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[47] Shie Mannor,et al. Reward Constrained Policy Optimization , 2018, ICLR.

[48] Dale Schuurmans,et al. Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[49] Atil Iscen,et al. Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[50] Dale Schuurmans,et al. Smoothed Action Value Functions for Learning Gaussian Policies , 2018, ICML.

[51] Peter Fankhauser,et al. ANYmal - a highly mobile and dynamic quadrupedal robot , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[52] Taylor Apgar,et al. Fast Online Trajectory Optimization for the Bipedal Robot Cassie , 2018, Robotics: Science and Systems.

[53] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[54] Raia Hadsell,et al. Value constrained model-free continuous control , 2019, ArXiv.

[55] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[56] Daniel E. Koditschek,et al. Design Principles for a Family of Direct-Drive Legged Robots , 2016, IEEE Robotics and Automation Letters.

[57] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[58] Marc H. Raibert,et al. Legged Robots That Balance , 1986, IEEE Expert.