Parallel Online Temporal Difference Learning for Motor Control

Temporal difference (TD) learning, a key concept in reinforcement learning, is a popular method for solving simulated control problems. However, in real systems, this method is often avoided in favor of policy search methods because of its long learning time. But policy search suffers from its own drawbacks, such as the necessity of informed policy parameterization and initialization. In this paper, we show that TD learning can work effectively in real robotic systems as well, using parallel model learning and planning. Using locally weighted linear regression and trajectory sampled planning with 14 concurrent threads, we can achieve a speedup of almost two orders of magnitude over regular TD control on simulated control benchmarks. For a real-world pendulum swing-up task and a two-link manipulator movement task, we report a speedup of 20× to 60×, with a real-time learning speed of less than half a minute. The results are competitive with state-of-the-art policy search.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[3]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[4]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[5]  Jürgen Schmidhuber,et al.  Quasi-online reinforcement learning for robots , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[8]  Martin A. Riedmiller,et al.  Approximate model-assisted Neural Fitted Q-Iteration , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[9]  Michael L. Littman,et al.  An empirical evaluation of interval estimation for Markov decision processes , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[10]  Olivier Sigaud,et al.  Robot Skill Learning: From Reinforcement Learning to Evolution Strategies , 2013, Paladyn J. Behav. Robotics.

[11]  Wouter Caarls,et al.  Learning while preventing mechanical failure due to random motions , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[13]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[14]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[15]  Martin A. Riedmiller,et al.  Neural Reinforcement Learning Controllers for a Real Robot Application , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[16]  Victor Palmer,et al.  Scaling reinforcement learning to the unconstrained multi-agent domain , 2007 .

[17]  Jeff G. Schneider,et al.  Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[18]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[19]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[20]  Martijn Wisse,et al.  The design of LEO: A 2D bipedal walking robot for online autonomous Reinforcement Learning , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[22]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[23]  Peter Stone,et al.  Learning and Using Models , 2012, Reinforcement Learning.

[24]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[25]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[26]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[27]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[28]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[29]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[30]  Wei Zhong,et al.  Energy and passivity based control of the double inverted pendulum on a cart , 2001, Proceedings of the 2001 IEEE International Conference on Control Applications (CCA'01) (Cat. No.01CH37204).

[31]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[32]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[33]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[34]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[35]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[36]  Michael J. Watts,et al.  IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Publication Information , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[38]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[39]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[40]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[41]  Dale Schuurmans,et al.  MapReduce for Parallel Reinforcement Learning , 2011, EWRL.

[42]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  Tomás Martínez-Marín,et al.  Fast Reinforcement Learning for Vision-guided Mobile Robots , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[44]  R. Matthew Kretchmar,et al.  Reinforcement Learning Algorithms for Homogenous Multi-Agent Systems , 2003 .

[45]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[46]  Robert Babuska,et al.  Efficient Model Learning Methods for Actor–Critic Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[47]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[48]  T. Tateyama,et al.  Parallel reinforcement learning systems using exploration agents and dyna-Q algorithm , 2007, SICE Annual Conference 2007.