Stable predictive representations with general value functions for continual learning

The objective of continual learning is to build agents that continually learn about their world, building on prior learning. In this paper, we explore an approach to continual learning based on making and updating many predictions formalized as general value functions (GVFs). The idea behind GVFs is simple: if we can cast the task of representing predictive knowledge as a prediction of future reward, then computationally efficient policy evaluation methods from reinforcement learning can be used to learn a large collection of predictions while the agent interacts with the world. We explore this idea further by analyzing how GVF predictions can be used as predictive features, and introduce two algorithmic techniques to ensure the stability of continual prediction learning. We illustrate these ideas with a small experiment in the cycle world domain.

[1]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[3]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[4]  Michael H. Bowling,et al.  Online Discovery and Learning of Predictive State Representations , 2005, NIPS.

[5]  Eric Wiewiora,et al.  Learning predictive representations from a history , 2005, ICML.

[6]  Michael R. James,et al.  Learning predictive state representations in dynamical systems without reset , 2005, ICML.

[7]  Satinder P. Singh,et al.  Kernel Predictive Linear Gaussian models for nonlinear stochastic dynamical systems , 2006, ICML.

[8]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[9]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[10]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[11]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[12]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[13]  Michael H. Bowling,et al.  Learning predictive state representations using non-blind policies , 2006, ICML '06.

[14]  Michael R. James,et al.  Learning and discovery of predictive state representations in dynamical systems with reset , 2004, ICML.

[15]  Satinder P. Singh,et al.  Predictive state representations with options , 2006, ICML.

[16]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[17]  Yaoliang Yu,et al.  Minimizing Nonconvex Non-Separable Functions , 2015, AISTATS.

[18]  Richard S. Sutton,et al.  Temporal-Difference Networks with History , 2005, IJCAI.

[19]  David Silver,et al.  Gradient Temporal Difference Networks , 2012, EWRL.

[20]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[21]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[22]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[23]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.