Learning to Control a 6-Degree-of-Freedom Walking Robot

We analyze the issue of optimizing a control policy for a complex system in a simulated trial-and-error learning process. The approach to this problem we consider is reinforcement learning (RL). Stationary policies, applied by most RL methods, may be improper in control applications, since for time discretization fine enough they do not exhibit exploration capabilities and define policy gradient estimators of very large variance. As a remedy to those difficulties, we proposed earlier the use of piecewise non-Markov policies. In the experimental study presented here we apply our approach to a 6-degree-of-freedom walking robot and obtain an efficient policy for this object.

[1]  David E. Orin,et al.  Efficient Dynamic Computer Simulation of Robotic Mechanisms , 1982 .

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  Shigenobu Kobayashi,et al.  An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function , 1998, ICML.

[5]  D. Barnes,et al.  Hexapodal robot locomotion over uneven terrain , 1998, Proceedings of the 1998 IEEE International Conference on Control Applications (Cat. No.98CH36104).

[6]  Chee-Meng Chew,et al.  Blind walking of a planar biped on sloped terrain , 1998 .

[7]  Martin Buehler,et al.  SCOUT: a simple quadruped that walks, climbs, and runs , 1998, Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146).

[8]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[9]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[10]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[11]  Auke Jan Ijspeert,et al.  A connectionist central pattern generator for the aquatic and terrestrial gaits of a simulated salamander , 2001, Biological Cybernetics.

[12]  Yasuhiro Fukuoka,et al.  Adaptive Dynamic Walking of a Quadruped Robot on Irregular Terrain Based on Biological Concepts , 2003, Int. J. Robotics Res..

[13]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[14]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[15]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[16]  Hiroshi Shimizu,et al.  Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment , 1991, Biological Cybernetics.

[17]  Shin Ishii,et al.  Reinforcement Learning for CPG-Driven Biped Robot , 2004, AAAI.

[18]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[21]  Pawe L Wawrzy´nski Reinforcement Learning in Fine Time Discretization , 2007 .