论文信息 - Scaling life-long off-policy learning

Scaling life-long off-policy learning

In this paper we pursue an approach to scaling life-long learning using parallel off-policy reinforcement learning algorithms. In life-long learning a robot continually learns from a life-time of experience, slowly acquiring and applying skills and knowledge to new situations. Many of the benefits of life-long learning are a results of scaling the amount of training data, processed by the robot, to long sensorimotor streams. Another dimension of scaling can be added by allowing off-policy sampling from the unending stream of sensorimotor data generated by a long-lived robot. Recent algorithmic developments have made it possible to apply off-policy algorithms to life-long learning, in a sound way, for the first time. We assess the scalability of these off-policy algorithms on a physical robot. We show that hundreds of accurate multi-step predictions can be learned about several policies in parallel and in realtime. We present the first online measures of off-policy learning progress. Finally we demonstrate that our robot, using the new off-policy measures, can learn 8000 predictions about 300 distinct policies, a substantial increase in scale compared to previous simulated and robotic life-long learning systems.

[1] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2] Sebastian Thrun,et al. Probabilistic robotics , 2002, CACM.

[3] John Langford,et al. Parallel Online Learning , 2011, ArXiv.

[4] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5] Erik Talvitie,et al. Learning to Make Predictions In Partially Observable Environments Without a Generative Model , 2011, J. Artif. Intell. Res..

[6] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[7] R. Sutton,et al. Off-policy Learning with Recognizers , 2000 .

[8] Alexander J. Smola,et al. Multitask Learning without Label Correspondences , 2010, NIPS.

[9] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[10] Stefan Schaal,et al. Robot Learning From Demonstration , 1997, ICML.

[11] Byron Boots,et al. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems , 2011, AAAI.

[12] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[13] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[14] Richard L. Lewis,et al. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[15] Jürgen Schmidhuber,et al. A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[16] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[18] Richard S. Sutton,et al. Temporal-Difference Networks , 2004, NIPS.

[19] R. S. Sutton,et al. Dynamic switching and real-time machine learning for improved human control of assistive biomedical robots , 2012, 2012 4th IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob).

[20] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[21] Paul Newman,et al. Highly scalable appearance-only SLAM - FAB-MAP 2.0 , 2009, Robotics: Science and Systems.

[22] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[23] Jan Peters,et al. Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[24] J. Zico Kolter,et al. The Fixed Points of Off-Policy TD , 2011, NIPS.

[25] Sebastian Thrun,et al. Lifelong robot learning , 1993, Robotics Auton. Syst..

[26] T. Başar,et al. A New Approach to Linear Filtering and Prediction Problems , 2001 .

[27] Richard S. Sutton,et al. Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.