Scale Invariant Value Computation for Reinforcement Learning in Continuous Time

Natural learners must compute an estimate of future outcomes that follow from a stimulus in continuous time. Critically, the learner cannot in general know a priori the relevant time scale over which meaningful relationships will be observed. Widely used reinforcement learning algorithms discretize continuous time and use the Bellman equation to estimate exponentially-discounted future reward. However, exponential discounting introduces a time scale to the computation of value, implying that the relative values of various states depend on how time is discretized. This is a serious problem in continuous time as successful learning requires prior knowledge of the solution. We discuss a recent computational hypothesis, developed based on work in psychology and neuroscience, for computing a scale-invariant timeline of future events. This hypothesis efficiently computes a model for future time on a logarithmically-compressed scale. Here we show that this model for future prediction can be used to generate a scale-invariant power-law-discounted estimate of expected future reward. The scale-invariant timeline could provide the centerpiece of a neurocognitive framework for reinforcement learning in continuous time. Introduction In reinforcement learning, an agent learns how to optimize its actions from interacting with the environment, aiming to maximize temporally-discounted future reward. In order to navigate the environment, the agent perceives stimuli that define different states. The stimuli are experienced embedded in continuous time with temporal relationships that the agent needs to learn in order to learn the optimal action policy. Temporal discounting is well justified by numerous behavioral experiments on humans and animals (see e.g. Kurth-Nelson, Bickel, and Redish (2012)) and it is useful in numerous practical applications (see e.g. Mnih et al. (2015)). If the value of a state is defined as expected future reward discounted with an exponential function of future time, value can be updated in a recursive fashion, following the Bellman equation (Bellman, 1957). The Bellman equation is a foundation of highly successful and widely used modern reinforcement learning approaches such as dynamic programming and temporal difference (TD) learning (Sutton and Barto, 1998). Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Exponential temporal discounting is not scale-invariant When using the Bellman equation (or exponential discounting in general), values assigned to the states will depend on the chosen discretization of the temporal axis in a non-linear fashion. Consequently the ratio of the values attributed to the states changes as a function of the chosen temporal resolution and the base of the exponential function. To illustrate this let us define the value of a state s observed at time t as a sum of expected rewards r discounted with an exponential function:

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Marc W Howard,et al.  A Simple biophysically plausible model for long time constants in single neurons , 2015, Hippocampus.

[3]  Marc W. Howard,et al.  A Scale-Invariant Internal Representation of Time , 2012, Neural Computation.

[4]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[5]  P. Dayan The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[6]  Richard S. Sutton,et al.  Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System , 2008, Neural Computation.

[7]  Marc W. Howard,et al.  Sequential firing codes for time in rodent mPFC , 2015 .

[8]  R. Bellman A Markovian Decision Process , 1957 .

[9]  Z. Kurth-Nelson,et al.  A theoretical account of cognitive effects in delay discounting , 2012, The European journal of neuroscience.

[10]  Qian Du,et al.  A Unified Mathematical Framework for Coding Time, Space, and Sequences in the Hippocampal Region , 2014, The Journal of Neuroscience.

[11]  William H. Alexander,et al.  Hyperbolically Discounted Temporal Difference Learning , 2010, Neural Computation.

[12]  H. Eichenbaum,et al.  Hippocampal “Time Cells” Bridge the Gap in Memory for Discontiguous Events , 2011, Neuron.

[13]  Zoran Tiganj,et al.  Encoding the Laplace transform of stimulus history using mechanisms for persistent firing , 2013, BMC Neuroscience.

[14]  Marc W. Howard,et al.  Neural Mechanism to Simulate a Scale-Invariant Future , 2015, Neural Computation.

[15]  Marc W. Howard,et al.  Optimally fuzzy scale-free memory , 2012, ArXiv.

[16]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[17]  Marc W. Howard,et al.  A distributed representation of temporal context , 2002 .

[18]  Joseph J. Paton,et al.  A Scalable Population Code for Time in the Striatum , 2015, Current Biology.

[19]  Elliot A. Ludvig,et al.  Evaluating the TD model of classical conditioning , 2012, Learning & behavior.

[20]  Marc W Howard,et al.  Time Cells in Hippocampal Area CA3 , 2016, The Journal of Neuroscience.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.