Reinforcement Learning Models Then-and-Now: From Single Cells to Modern Neuroimaging

Although much ignored in some intellectual circles today, behaviorism and its models from the early to mid parts of the twentieth century provided the basis for some of the first computational accounts of reward learning. The best expression of this work emerged in the early 1970s with the Rescorla–Wagner model of Pavlovian conditioning. This model accounted for a range of behavioral data about how animals learn about cues that predict rewarding outcomes. The step forward in this account was that learning was depicted as being driven by failed predictions—that is, some system collected information, formed expectations about how much reward to expect (associated with “conditioned stimuli” or cs), and generated learning updates that were proportional to the size and sign of the error. While successful in describing a large body of data, the Rescorla–Wagner model failed at one critical aspect of simple learning—the capacity to “chain” important cues together into a trajectory of learned associations—a feature called secondary conditioning: “A predicts B predicts food,” for example.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  W. Schultz,et al.  Dopamine neurons of the monkey midbrain: contingencies of responses to active touch during self-initiated arm movements. , 1990, Journal of neurophysiology.

[5]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[6]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[7]  W. Schultz,et al.  Responses of monkey dopamine neurons during learning of behavioral reactions. , 1992, Journal of neurophysiology.

[8]  W. Schultz,et al.  Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[9]  Peter Dayan,et al.  Bee foraging in uncertain environments using predictive hebbian learning , 1995, Nature.

[10]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[11]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[12]  John T. Williams,et al.  Nicotine activates and desensitizes midbrain dopamine neurons , 1997, Nature.

[13]  G F Koob,et al.  Drug abuse: hedonic homeostatic dysregulation. , 1997, Science.

[14]  J. Hollerman,et al.  Dopamine neurons report an error in the temporal prediction of reward during learning , 1998, Nature Neuroscience.

[15]  J S Fowler,et al.  Role of dopamine in drug reinforcement and addiction in humans: results from imaging studies. , 2002, Behavioural pharmacology.

[16]  Samuel M. McClure,et al.  A computational substrate for incentive salience , 2003, Trends in Neurosciences.

[17]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[18]  Jonathan D. Cohen,et al.  Computational roles for dopamine in behavioural control , 2004, Nature.

[19]  A. Redish,et al.  Addiction as a Computational Process Gone Awry , 2004, Science.

[20]  W. Schultz,et al.  Adaptive Coding of Reward Value by Dopamine Neurons , 2005, Science.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[23]  K. Doya,et al.  The computational neurobiology of learning and reward , 2006, Current Opinion in Neurobiology.

[24]  Kevin McCabe,et al.  Neural signature of fictive learning signals in a sequential investment task , 2007, Proceedings of the National Academy of Sciences.

[25]  P Read Montague,et al.  Disrupting addiction through the loss of drug-associated internal states , 2007, Nature Neuroscience.

[26]  Pearl H. Chiu,et al.  Smokers' brains compute, but ignore, a fictive error signal in a sequential investment task , 2008, Nature Neuroscience.

[27]  John M. Pearson,et al.  Fictive Reward Signals in the Anterior Cingulate Cortex , 2009, Science.

[28]  Janet B W Williams,et al.  Diagnostic and Statistical Manual of Mental Disorders , 2013 .