A neural reinforcement learning model for tasks with unknown time delays Daniel Rasmussen (drasmuss@uwaterloo.ca) Chris Eliasmith (celiasmith@uwaterloo.ca) Centre for Theoretical Neuroscience, University of Waterloo Waterloo, ON, Canada, N2J 3G1 Abstract We present a biologically based neural model capable of per- forming reinforcement learning in complex tasks. The model is unique in its ability to solve tasks that require the agent to make a sequence of unrewarded actions in order to reach the goal, in an environment where there are unknown and vari- able time delays between actions, state transitions, and re- wards. Specifically, this is the first neural model of reinforce- ment learning able to function within a Semi-Markov Decision Process (SMDP) framework. We believe that this extension of current modelling efforts lays the groundwork for increasingly sophisticated models of human decision making. Keywords: reinforcement learning; neural model; SMDP diate reward but also the next situation and, through that, all subsequent rewards (Sutton & Barto, 1998).” Most existing neural models have performed only associa- tive reinforcement learning, where there is no consideration of future reward (Niv et al., 2002; Seung, 2003; Baras & Meir, 2007; Florian, 2007; Izhikevich, 2007; Frank & Badre, 2012; Stewart et al., 2012). An example of this type of task is bandit learning, where the agent selects one of n available options, receives reward, then is reset back to the choice point. Each trial is independent, so the agent only needs to learn the im- mediate reward associated with each option, and then pick the best one. This can be expressed in the RL notation as Introduction One of the most successful areas of cross-fertilization be- tween computational modelling and the study of the brain has been the domain of reinforcement learning (RL). This began with the work of Schultz (1998), who demonstrated that the well-defined computational mechanisms of models (e.g., TD reinforcement learning) could provide insight into some of the more opaque mechanisms of the brain (e.g., dopamine signalling). The models used in that early work were purely algorith- mic, with little relation to the biological properties of the brain. However, since that first demonstration many new models have been developed, allowing novel or more de- tailed comparisons to neural mechanisms—models that more closely reflect the structures of the brain (Frank & Badre, 2012; Stewart et al., 2012), the behaviour of individual neu- rons (Seung, 2003; Potjans et al., 2009), or synaptic learning mechanisms (Florian, 2007; Baras & Meir, 2007). In our work we seek to retain the neuroanatomical detail of these models, but expand their functionality; that is, to build models capable of more powerful learning and decision mak- ing, enabling them to solve more complex problems. Here we present some first steps in that direction. Specifically, we will discuss the implementation and show early results from a model that is able to solve tasks requiring extended sequences of actions, in environments where there may be unknown and variable time delays between actions and rewards. Background Sutton & Barto’s seminal introduction to reinforcement learn- ing illustrates the important challenge for expanding the func- tion of neural RL models: “Reinforcement learning is learn- ing what to do—how to map situations to actions—so as to maximize a numerical reward signal...In the most interesting and challenging cases, actions may affect not only the imme- Q(s, a) = r(s, a) where Q(s, a) is the agent’s estimate of the value of taking ac- tion a in state s, and r(s, a) is the immediate reward received for performing that action in that state. These Q values can be learned by observing r(s, a) and then updating Q(s, a) to bring it closer to the observation. The challenge addressed by many of the models above is how to do that update in a neurally plausible manner. An example of a more complex reinforcement learning task is a navigation problem, where an agent seeking to reach a goal must choose a direction to move. The agent may receive no immediate reward for making a choice, but there are still good and bad choices (bringing it closer to or farther from the goal). In order to make correct decisions, the agent needs to be able to learn not only the immediate rewards, but the re- wards to be expected in the future after taking a given action. This can be expressed as Q(s, a) = r(s, a) + γQ(s 0 , a 0 ) In other words, the value of taking action a is equivalent to the immediate reward (as in the previous case), plus the expected value of the action taken in the resulting state (indicating the future reward expected from that state). The future value is discounted by γ < 1 to indicate that future rewards are valued less than immediate rewards. The Q values can be learned by comparing the predicted value of action a to the observed val- ues upon arriving in state s 0 . This is the temporal difference (TD) learning formula 1 : ∆Q(s, a) = κ r(s, a) + γQ(s 0 , a 0 ) − Q(s, a) Most complex problems of the type faced by the brain require the consideration of the future impact of a given action; thus 1 More specifically, this is the SARSA learning update (Rummery & Niranjan, 1994).
[1]
Andrew G. Barto,et al.
Reinforcement learning
,
1998
.
[2]
Ronald A. Howard,et al.
Dynamic Probabilistic Systems
,
1971
.
[3]
Chris Eliasmith,et al.
Neural Engineering: Computation, Representation, and Dynamics in Neurobiological Systems
,
2004,
IEEE Transactions on Neural Networks.
[4]
Sridhar Mahadevan,et al.
Recent Advances in Hierarchical Reinforcement Learning
,
2003,
Discret. Event Dyn. Syst..
[5]
C. Eliasmith,et al.
Dynamic Behaviour of a Spiking Model of Action Selection in the Basal Ganglia Neural Structure
,
2010
.
[6]
Richard S. Sutton,et al.
Dimensions of Reinforcement Learning
,
1998
.
[7]
C. Eliasmith,et al.
Learning to Select Actions with Spiking Neurons in the Basal Ganglia
,
2012,
Front. Neurosci..
[8]
Mahesan Niranjan,et al.
On-line Q-learning using connectionist systems
,
1994
.
[9]
David J. Foster,et al.
A model of hippocampally dependent navigation, using the temporal difference learning rule
,
2000,
Hippocampus.
[10]
Chris Eliasmith,et al.
Fine-Tuning and the Stability of Recurrent Neural Networks
,
2011,
PloS one.
[11]
Chris Eliasmith,et al.
A Unified Approach to Building and Controlling Spiking Attractor Networks
,
2005,
Neural Computation.
[12]
Kae Nakamura,et al.
Predictive Reward Signal of Dopamine Neurons
,
2015
.
[13]
H. Seung,et al.
Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission
,
2003,
Neuron.
[14]
E. Izhikevich.
Solving the distal reward problem through linkage of STDP and dopamine signaling
,
2007,
BMC Neuroscience.
[15]
Markus Diesmann,et al.
A Spiking Neural Network Model of an Actor-Critic Learning Agent
,
2009,
Neural Computation.
[16]
Razvan V. Florian,et al.
Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity
,
2007,
Neural Computation.
[17]
Ron Meir,et al.
Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule
,
2007,
Neural Computation.
[18]
Andrew W. Moore,et al.
Reinforcement Learning: A Survey
,
1996,
J. Artif. Intell. Res..
[19]
Barry D. Nichols.
Reinforcement learning in continuous state- and action-space
,
2014
.
[20]
Michael O. Duff,et al.
Reinforcement Learning Methods for Continuous-Time Markov Decision Problems
,
1994,
NIPS.
[21]
M. Frank,et al.
Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis.
,
2012,
Cerebral cortex.
[22]
Y. Niv,et al.
Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors
,
2002
.
[23]
Doina Precup,et al.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
,
1999,
Artif. Intell..
[24]
Trevor Bekolay,et al.
A Large-Scale Model of the Functioning Brain
,
2012,
Science.