Scaling life-long off-policy learning

In this paper we pursue an approach to scaling life-long learning using parallel off-policy reinforcement learning algorithms. In life-long learning a robot continually learns from a life-time of experience, slowly acquiring and applying skills and knowledge to new situations. Many of the benefits of life-long learning are a results of scaling the amount of training data, processed by the robot, to long sensorimotor streams. Another dimension of scaling can be added by allowing off-policy sampling from the unending stream of sensorimotor data generated by a long-lived robot. Recent algorithmic developments have made it possible to apply off-policy algorithms to life-long learning, in a sound way, for the first time. We assess the scalability of these off-policy algorithms on a physical robot. We show that hundreds of accurate multi-step predictions can be learned about several policies in parallel and in realtime. We present the first online measures of off-policy learning progress. Finally we demonstrate that our robot, using the new off-policy measures, can learn 8000 predictions about 300 distinct policies, a substantial increase in scale compared to previous simulated and robotic life-long learning systems.

[1]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[3]  John Langford,et al.  Parallel Online Learning , 2011, ArXiv.

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Erik Talvitie,et al.  Learning to Make Predictions In Partially Observable Environments Without a Generative Model , 2011, J. Artif. Intell. Res..

[6]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[7]  R. Sutton,et al.  Off-policy Learning with Recognizers , 2000 .

[8]  Alexander J. Smola,et al.  Multitask Learning without Label Correspondences , 2010, NIPS.

[9]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[10]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[11]  Byron Boots,et al.  An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems , 2011, AAAI.

[12]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[13]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[14]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[15]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[18]  Richard S. Sutton,et al.  Temporal-Difference Networks , 2004, NIPS.

[19]  R. S. Sutton,et al.  Dynamic switching and real-time machine learning for improved human control of assistive biomedical robots , 2012, 2012 4th IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob).

[20]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[21]  Paul Newman,et al.  Highly scalable appearance-only SLAM - FAB-MAP 2.0 , 2009, Robotics: Science and Systems.

[22]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[23]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[24]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[25]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[26]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[27]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.