论文信息 - Stochastic policy gradient reinforcement learning on a simple 3D biped

Stochastic policy gradient reinforcement learning on a simple 3D biped

We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blank-slate using only trials implemented on our physical robot. The robot begins walking within a minute and learning converges in approximately 20 minutes. This success can be attributed to the mechanics of our robot, which are modeled after a passive dynamic walker, and to a dramatic reduction in the dimensionality of the learning problem. We reduce the dimensionality by designing a robot with only 6 internal degrees of freedom and 4 actuators, by decomposing the control system in the frontal and sagittal planes, and by formulating the learning problem on the discrete return map dynamics. We apply a stochastic policy gradient algorithm to this reduced problem and decrease the variance of the update using a state-based estimate of the expected cost. This optimized learning system works quickly enough that the robot is able to continually adapt to the terrain as it walks.

[1] Tad McGeer,et al. Passive Dynamic Walking , 1990, Int. J. Robotics Res..

[2] W.T. Miller. Real-time neural network control of a biped walking robot , 1994, IEEE Control Systems.

[3] Judy A. Franklin,et al. Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[4] Andy Ruina,et al. An Uncontrolled Toy That Can Walk But Cannot Stand Still , 1997, physics/9711006.

[5] M. Coleman,et al. An Uncontrolled Walking Toy That Cannot Stand Still , 1998 .

[6] Shigenobu Kobayashi,et al. An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function , 1998, ICML.

[7] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[9] Martijn Wisse,et al. A Three-Dimensional Passive-Dynamic Walking Robot with Two Legs and Knees , 2001, Int. J. Robotics Res..

[10] Jun Morimoto,et al. Minimax Differential Dynamic Programming: An Application to Robust Biped Walking , 2002, NIPS.

[11] H. Sebastian Seung,et al. Actuating a simple 3D passive dynamic walker , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[12] Peter Stone,et al. Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[13] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.