Which Temporal Difference Learning Algorithm Best Reproduces Dopamine Activity in a Multi-choice Task?

The activity of dopaminergic (DA) neurons has been hypothesized to encode a reward prediction error (RPE) which corresponds to the error signal in Temporal Difference (TD) learning algorithms. This hypothesis has been reinforced by numerous studies showing the relevance of TD learning algorithms to describe the role of basal ganglia in classical conditioning. However, recent recordings of DA neurons during multi-choice tasks raised contradictory interpretations on whether DA’s RPE signal is action dependent or not. Thus the precise TD algorithm (i.e. Actor-Critic, Q-learning or SARSA) that best describes DA signals remains unknown. Here we simulate and precisely analyze these TD algorithms on a multi-choice task performed by rats. We find that DA activity previously reported in this task is best fitted by a TD error which has not fully converged, and which converged faster than observed behavioral adaptation.

[1]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[2]  O. Hikosaka,et al.  Two types of dopamine neuron distinctly convey positive and negative motivational signals , 2009, Nature.

[3]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[4]  W. Schultz Predictive reward signal of dopamine neurons. , 1998, Journal of neurophysiology.

[5]  Y. Niv,et al.  Dialogues on prediction errors , 2008, Trends in Cognitive Sciences.

[6]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[7]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[8]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[9]  Saori C. Tanaka,et al.  Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops , 2004, Nature Neuroscience.

[10]  J. Hollerman,et al.  Dopamine neurons report an error in the temporal prediction of reward during learning , 1998, Nature Neuroscience.

[11]  M. Roesch,et al.  Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards , 2007, Nature Neuroscience.

[12]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[13]  N. Daw Dopamine: at the intersection of reward and action , 2007, Nature Neuroscience.