Optimizing agent behavior over long time scales by transporting value

Humans prolifically engage in mental time travel. We dwell on past actions and experience satisfaction or regret. More than storytelling, these recollections change how we act in the future and endow us with a computationally important ability to link actions and consequences across spans of time, which helps address the problem of long-term credit assignment: the question of how to evaluate the utility of actions within a long-duration behavioral sequence. Existing approaches to credit assignment in AI cannot solve tasks with long delays between actions and consequences. Here, we introduce a paradigm where agents use recall of specific memories to credit past actions, allowing them to solve problems that are intractable for existing algorithms. This paradigm broadens the scope of problems that can be investigated in AI and offers a mechanistic account of behaviors that may inspire models in neuroscience, psychology, and behavioral economics. People are able to mentally time travel to distant memories and reflect on the consequences of those past events. Here, the authors show how a mechanism that connects learning from delayed rewards with memory retrieval can enable AI agents to discover links between past events to help decide better courses of action in the future.

[1]  H. Blodgett,et al.  The effect of the introduction of reward upon the maze performance of rats , 1929 .

[2]  P. Samuelson A Note on Measurement of Utility , 1937 .

[3]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[4]  Allen Newell,et al.  The chess machine: an example of dealing with a complex task by adaptation , 1955, AFIPS '55 (Western).

[5]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1959, IBM J. Res. Dev..

[6]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[7]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[8]  Intrahousehold Resource Allocation in Developing Countries: Methods, Models, and Policy , 1997 .

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[11]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[12]  R. Klein,et al.  The dawn of human culture , 2002 .

[13]  G. Loewenstein,et al.  Time Discounting and Time Preference: A Critical Review , 2002 .

[14]  Marcus Hutter A Gentle Introduction to The Universal Algorithmic Agent AIXI , 2003 .

[15]  W. K. Cullen,et al.  Dopamine-dependent facilitation of LTP induction in hippocampal CA1 by exposure to spatial novelty , 2003, Nature Neuroscience.

[16]  M. McDaniel,et al.  Delaying execution of intentions: overcoming the costs of interruptions , 2004 .

[17]  Ronald J. Williams Simple statistical gradient-following algorithms for connectionist reinforcement learning , 2004, Machine Learning.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[19]  D. Fudenberg,et al.  A Dual-Self Model of Impulse Control. , 2006, The American economic review.

[20]  N. Lemon,et al.  Dopamine D1/D5 Receptors Gate the Acquisition of Novel Information through Hippocampal Long-Term Potentiation and Long-Term Depression , 2006, The Journal of Neuroscience.

[21]  D. Hassabis,et al.  Using Imagination to Understand the Neural Basis of Episodic Memory , 2007, The Journal of Neuroscience.

[22]  D. Schacter,et al.  Remembering the past to imagine the future: the prospective brain , 2007, Nature Reviews Neuroscience.

[23]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[24]  Russ Tedrake,et al.  Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms , 2008, NIPS.

[25]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[26]  Aude Oliva,et al.  Visual long-term memory has a massive storage capacity for object details , 2008, Proceedings of the National Academy of Sciences.

[27]  John R. Anderson,et al.  Solving the credit assignment problem: explicit and implicit learning of action sequences with probabilistic outcomes , 2008, Psychological research.

[28]  Jan Peters,et al.  Episodic Future Thinking Reduces Reward Delay Discounting through an Enhancement of Prefrontal-Mediotemporal Interactions , 2010, Neuron.

[29]  Mohamed Chtourou,et al.  On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[30]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Geoffrey E. Hinton,et al.  Training recurrent neural networks , 2013 .

[32]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[33]  Michael C. Corballis The Recursive Mind: The Origins of Human Language, Thought, and Civilization - Updated Edition , 2014 .

[34]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[35]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[36]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2015, ICLR.

[38]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[39]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[42]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[43]  Joel Z. Leibo,et al.  Model-Free Episodic Control , 2016, ArXiv.

[44]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[45]  John Schulman Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs , 2016 .

[46]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[47]  D. Hassabis,et al.  Neuroscience-Inspired Artificial Intelligence , 2017, Neuron.

[48]  N. Daw,et al.  Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework , 2017, Annual review of psychology.

[49]  Shane Legg,et al.  Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents , 2018, ArXiv.

[50]  J. Pearl,et al.  The Book of Why: The New Science of Cause and Effect , 2018 .

[51]  Joel Z. Leibo,et al.  Unsupervised Predictive Memory in a Goal-Directed Agent , 2018, ArXiv.

[52]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[53]  Zeb Kurth-Nelson,et al.  Been There, Done That: Meta-Learning with Episodic Recall , 2018, ICML.

[54]  Mélanie Frappier The Book of Why: The New Science of Cause and Effect , 2018, Science.

[55]  Christopher Joseph Pal,et al.  Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding , 2018, NeurIPS.

[56]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2019, ICLR.

[57]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[58]  Jane X. Wang,et al.  Reinforcement Learning, Fast and Slow , 2019, Trends in Cognitive Sciences.