General Value Function Networks

State construction is important for learning in partially observable environments. A general purpose strategy for state construction is to learn the state update using a Recurrent Neural Network (RNN), which updates the internal state using the current internal state and the most recent observation. This internal state provides a summary of the observed sequence, to facilitate accurate predictions and decision-making. At the same time, RNNs can be hard to specify and train for non-experts. Training RNNs is notoriously tricky, particularly as the common strategy to approximate gradients back in time, called truncated Back-prop Through Time (BPTT), can be sensitive to the truncation window. Further, domain-expertise---which can usually help constrain the function class and so improve trainability---can be difficult to incorporate into complex recurrent units used within RNNs. In this work, we explore how to use multi-step predictions, as a simple and general approach to inject prior knowledge, while retaining much of the generality and learning power behind RNNs. In particular, we revisit the idea of using predictions to construct state and ask: does constraining (parts of) the state to consist of predictions about the future improve RNN trainability? We formulate a novel RNN architecture, called a General Value Function Network (GVFN), where each internal state component corresponds to a prediction about the future represented as a value function. We first provide an objective for optimizing GVFNs, and derive several algorithms to optimize this objective. We then show that GVFNs are more robust to the truncation level, in many cases only requiring one-step gradient updates.

[1]  Richard S. Sutton,et al.  Temporal-Difference Networks , 2004, NIPS.

[2]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[3]  Christopher M. Vigorito Temporal-Difference Networks for Dynamical Systems with Continuous Observations and Actions , 2009, UAI.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Byron Boots,et al.  Learning to Filter with Predictive State Inference Machines , 2015, ICML.

[6]  Byron Boots,et al.  Initialization matters: Orthogonal Predictive State Recurrent Neural Networks , 2018, ICLR.

[7]  Takaki Makino,et al.  On-line discovery of temporal-difference networks , 2008, ICML '08.

[8]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[9]  Tom Schaul,et al.  Better Generalization with Forecasts , 2013, IJCAI.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Maja J. Matarić,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[12]  Dean Alderucci A SPECTRAL ALGORITHM FOR LEARNING HIDDEN MARKOV MODELS THAT HAVE SILENT STATES , 2015 .

[13]  Michael H. Bowling,et al.  Online Discovery and Learning of Predictive State Representations , 2005, NIPS.

[14]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[15]  Jun Jin,et al.  Learning predictive representations in autonomous driving to improve deep reinforcement learning , 2020, ArXiv.

[16]  Martha White,et al.  Online Off-policy Prediction , 2018, ArXiv.

[17]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[18]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[19]  Byron Boots,et al.  Predictive State Recurrent Neural Networks , 2017, NIPS.

[20]  Richard S. Sutton,et al.  Representation Search through Generate and Test , 2013, AAAI Workshop: Learning Rich Representations from Low-Level Sensors.

[21]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[22]  Davide Maltoni,et al.  Continuous Learning in Single-Incremental-Task Scenarios , 2018, Neural Networks.

[23]  Satinder P. Singh,et al.  Predictive state representations with options , 2006, ICML.

[24]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[25]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[26]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[27]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[28]  Sebastian Thrun,et al.  Simultaneous Localization and Mapping , 2008, Robotics and Cognitive Approaches to Spatial Mapping.

[29]  Satinder P. Singh,et al.  Mixtures of Predictive Linear Gaussian Models for Nonlinear, Stochastic Dynamical Systems , 2006, AAAI.

[30]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[31]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[32]  A. Mahajan,et al.  Approximate information state for approximate planning and reinforcement learning in partially observed systems , 2020, J. Mach. Learn. Res..

[33]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[34]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[35]  Michael Cunningham Intelligence: Its Organization and Development , 1972 .

[36]  Patrick M. Pilarski,et al.  Predictions , Surprise , and Predictions of Surprise in General Value Function Architectures , 2018 .

[37]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[38]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[39]  Byron Boots,et al.  Predictive-State Decoders: Encoding the Future into Recurrent Networks , 2017, NIPS.

[40]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[41]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[42]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[43]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[44]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[45]  Robert Jenssen,et al.  Recurrent Neural Networks for Short-Term Load Forecasting , 2017, SpringerBriefs in Computer Science.

[46]  Marco C. Bettoni,et al.  Made-Up Minds: A Constructivist Approach to Artificial Intelligence , 1993, IEEE Expert.

[47]  Ronald Kemker,et al.  Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[48]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[49]  Richard S. Sutton,et al.  Temporal-Difference Networks with History , 2005, IJCAI.

[50]  Razvan Pascanu,et al.  Vector-based navigation using grid-like representations in artificial agents , 2018, Nature.

[51]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[52]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[53]  Mohamed Chtourou,et al.  On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[54]  Angelika Steger,et al.  Approximating Real-Time Recurrent Learning with Random Kronecker Factors , 2018, NeurIPS.

[55]  Yoshua Bengio,et al.  Learning Causal Models Online , 2020, ArXiv.

[56]  Le Song,et al.  Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[57]  Mike Innes,et al.  Flux: Elegant machine learning with Julia , 2018, J. Open Source Softw..

[58]  Angelika Steger,et al.  Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning , 2019, ICML.

[59]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[60]  Marc W. Howard,et al.  Predicting the Future with Multi-scale Successor Representations , 2018, bioRxiv.

[61]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[62]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2011, Int. J. Robotics Res..

[63]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[64]  Craig Sherstan Representation and General Value Functions , 2020 .

[65]  Adam White,et al.  Investigating Objectives for Off-policy Value Estimation in Reinforcement Learning , 2020 .

[66]  Lennart Ljung Perspectives on System Identification , 2008 .

[67]  Richard S. Sutton,et al.  Learning to Predict Independent of Span , 2015, ArXiv.

[68]  David Silver,et al.  Gradient Temporal Difference Networks , 2012, EWRL.

[69]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[70]  Richard S. Sutton,et al.  Using Predictive Representations to Improve Generalization in Reinforcement Learning , 2005, IJCAI.

[71]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[72]  Cyrill Stachniss,et al.  Simultaneous Localization and Mapping , 2016, Springer Handbook of Robotics, 2nd Ed..

[73]  Richard S. Sutton,et al.  Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods , 2018 .

[74]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[75]  Yann Ollivier,et al.  Unbiased Online Recurrent Optimization , 2017, ICLR.

[76]  Giovanni Pezzulo,et al.  Coordinating with the Future: The Anticipatory Nature of Representation , 2008, Minds and Machines.