Actions , Policies , Values , and the Basal Ganglia

The basal ganglia are widely believed to be involved in the learned selection of actions. Building on this idea, reinforcement learning (RL) theories of optimal control have had some success in explaining the responses of their key dopaminergic afferents. While these model-free RL theories offer a compelling account of a range of neurophysiological and behavioural data, they offer only an incomplete picture of action control in the brain. Psychologists and behavioural neuroscientists have long appealed to the existence of at least two separate control systems underlying the learned control of behaviour. The habit system is closely identified with the basal ganglia, and we associate it with the model-free RL theories. The other system, more loosely localised in prefrontal regions and without such a detailed theoretical account, is associated with cognitively more sophisticated goal-directed actions. On the critical issue of which system determines the ultimate output when they disagree, there is a wide range of experimental results and sparse theoretical underpinning. Here, we extend the RL account of neural action control by first interpreting goal-directed actions in terms of an alternative model-based strategy for RL. Then, by considering the relative uncertainties of modelfree and model-based controllers, we offer a new and more comprehensive account of the confusing experimental results about how the systems trade off control. Our theory offers a more sharply delineated view of the contributions of the basal ganglia to learned behavioural control.

[1]  F. W. Irwin Purposive Behavior in Animals and Men , 1932, The Psychological Clinic.

[2]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3]  R. Rescorla,et al.  The effect of two ways of devaluing the unconditioned stimulus after first- and second-order appetitive conditioning. , 1975, Journal of experimental psychology. Animal behavior processes.

[4]  P. Holland,et al.  Differential effects of two ways of devaluing the unconditioned stimulus after Pavlovian appetitive conditioning. , 1979, Journal of experimental psychology. Animal behavior processes.

[5]  Christopher D. Adams,et al.  Instrumental Responding following Reinforcer Devaluation , 1981 .

[6]  Christopher D. Adams Variations in the Sensitivity of Instrumental Responding to Reinforcer Devaluation , 1982 .

[7]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  P. Lovibond Facilitation of instrumental behavior by a Pavlovian appetitive conditioned stimulus. , 1983, Journal of experimental psychology. Animal behavior processes.

[9]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[10]  A. Dickinson Actions and habits: the development of behavioural autonomy , 1985 .

[11]  R. Rescorla,et al.  Instrumental responding remains sensitive to reinforcer devaluation after extensive training , 1985 .

[12]  G. E. Alexander,et al.  Parallel organization of functionally segregated circuits linking basal ganglia and cortex. , 1986, Annual review of neuroscience.

[13]  A. Dickinson,et al.  Pavlovian Processes in the Motivational Control of Instrumental Performance , 1987 .

[14]  B. Balleine,et al.  Instrumental Performance following Reinforcer Devaluation Depends upon Incentive Learning , 1991 .

[15]  B. Balleine Instrumental performance following a shift in primary motivation depends on incentive learning. , 1992, Journal of experimental psychology. Animal behavior processes.

[16]  W. Schultz Activity of dopamine neurons in the behaving primate , 1992 .

[17]  B. Balleine,et al.  Signalling and Incentive Processes in Instrumental Reinforcer Devaluation , 1992, The Quarterly journal of experimental psychology. B, Comparative and physiological psychology.

[18]  W. Schultz,et al.  Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[19]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[20]  B. Balleine,et al.  Motivational control of goal-directed action , 1994 .

[21]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[22]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[23]  R. Rescorla A note on Depression of Instrumental Responding after one Trial of Outcome Devaluation , 1994, The Quarterly journal of experimental psychology. B, Comparative and physiological psychology.

[24]  B. Williams Conditioned reinforcement: Neglected or outmoded explanatory construct? , 1994, Psychonomic bulletin & review.

[25]  P. Goldman-Rakic,et al.  Modulation of memory fields by dopamine Dl receptors in prefrontal cortex , 1995, Nature.

[26]  R. Boakes,et al.  Motivational control after extended instrumental training , 1995 .

[27]  A. Barto Adaptive Critics and the Basal Ganglia , 1995 .

[28]  Petros G. Voulgaris,et al.  On optimal ℓ∞ to ℓ∞ filtering , 1995, Autom..

[29]  B. Balleine,et al.  Motivational control of heterogeneous instrumental chains. , 1995 .

[30]  W. Schultz,et al.  Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli , 1996, Nature.

[31]  J. Wickens,et al.  Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex In vitro , 1996, Neuroscience.

[32]  A. Yuille,et al.  Bayesian decision theory and psychophysics , 1996 .

[33]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[34]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[35]  A. Owen Cognitive planning in humans: Neuropsychological, neuroanatomical and neuropharmacological perspectives , 1997, Progress in Neurobiology.

[36]  P. Holland Brain mechanisms for changes in processing of conditioned stimuli in Pavlovian conditioning: Implications for behavior theory , 1997 .

[37]  Eric B. Baum,et al.  A Bayesian Approach to Relevance in Game Playing , 1997, Artif. Intell..

[38]  G. Schoenbaum,et al.  Orbitofrontal cortex and basolateral amygdala encode expected outcomes during learning , 1998, Nature Neuroscience.

[39]  B. Balleine,et al.  Goal-directed instrumental action: contingency and incentive learning and their cortical substrates , 1998, Neuropharmacology.

[40]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[41]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[42]  G. Schoenbaum,et al.  Orbitofrontal Cortex and Representation of Incentive Value in Associative Learning , 1999, The Journal of Neuroscience.

[43]  P. Redgrave,et al.  The basal ganglia: a vertebrate solution to the selection problem? , 1999, Neuroscience.

[44]  W. Schultz,et al.  Relative reward preference in primate orbitofrontal cortex , 1999, Nature.

[45]  E T Rolls,et al.  Sensory-specific satiety-related olfactory activation of the human orbitofrontal cortex. , 2000, Neuroreport.

[46]  J. Mirenowicz,et al.  Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. , 2000, Behavioral neuroscience.

[47]  E T Rolls,et al.  Sensory‐specific satiety‐related olfactory activation of the human orbitofrontal cortex , 2000, Neuroreport.

[48]  B. Balleine,et al.  The Effect of Lesions of the Insular Cortex on Instrumental Conditioning: Evidence for a Role in Incentive Memory , 2000, The Journal of Neuroscience.

[49]  S. Kakade,et al.  Learning and selective attention , 2000, Nature Neuroscience.

[50]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[51]  J. Horvitz Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events , 2000, Neuroscience.

[52]  D. Joel,et al.  The connections of the dopaminergic system with the striatum in rats and primates: an analysis with respect to the functional and compartmental organization of the striatum , 2000, Neuroscience.

[53]  Nikolaus R. McFarland,et al.  Striatonigrostriatal Pathways in Primates Form an Ascending Spiral from the Shell to the Dorsolateral Striatum , 2000, The Journal of Neuroscience.

[54]  Peter Dayan,et al.  ACh, Uncertainty, and Cortical Inference , 2001, NIPS.

[55]  A. Dickinson,et al.  Involvement of the central nucleus of the amygdala and nucleus accumbens core in mediating Pavlovian influences on instrumental behaviour , 2001, The European journal of neuroscience.

[56]  E. Miller,et al.  An integrative theory of prefrontal cortex function. , 2001, Annual review of neuroscience.

[57]  Peter Dayan,et al.  Motivated Reinforcement Learning , 2001, NIPS.

[58]  G. Hall,et al.  Lesions of the Basolateral Amygdala Disrupt Selective Aspects of Reinforcer Representation in Rats , 2001, The Journal of Neuroscience.

[59]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[60]  B. Knowlton,et al.  Learning and memory functions of the Basal Ganglia. , 2002, Annual review of neuroscience.

[61]  Sham M. Kakade,et al.  Opponent interactions between serotonin and dopamine , 2002, Neural Networks.

[62]  Jonathan D. Cohen,et al.  Computational perspectives on dopamine function in prefrontal cortex , 2002, Current Opinion in Neurobiology.

[63]  S. Killcross,et al.  3. Associative representations of emotionally significant outcomes , 2002 .

[64]  P. Dayan,et al.  Reward, Motivation, and Reinforcement Learning , 2002, Neuron.

[65]  Eytan Ruppin,et al.  Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[66]  David S. Touretzky,et al.  Timing and Partial Observability in the Dopamine System , 2002, NIPS.

[67]  B. Everitt,et al.  Emotion and motivation: the role of the amygdala, ventral striatum, and prefrontal cortex , 2002, Neuroscience & Biobehavioral Reviews.

[68]  Peter Dayan,et al.  Dopamine: generalization and bonuses , 2002, Neural Networks.

[69]  T. Robbins,et al.  Effects of selective excitotoxic lesions of the nucleus accumbens core, anterior cingulate cortex, and central nucleus of the amygdala on autoshaping performance in rats. , 2002, Behavioral neuroscience.

[70]  G. Hall,et al.  Preserved Sensitivity to Outcome Value after Lesions of the Basolateral Amygdala , 2003, The Journal of Neuroscience.

[71]  B. Balleine,et al.  The Effect of Lesions of the Basolateral Amygdala on Instrumental Conditioning , 2003, The Journal of Neuroscience.

[72]  B. Balleine,et al.  The role of prelimbic cortex in instrumental conditioning , 2003, Behavioural Brain Research.

[73]  S. Killcross,et al.  Inactivation of the infralimbic prefrontal cortex reinstates goal-directed responding in overtrained rats , 2003, Behavioural Brain Research.

[74]  Tatsuo K Sato,et al.  Correlated Coding of Motivation and Outcome of Decision by Dopamine Neurons , 2003, The Journal of Neuroscience.

[75]  N. Daw,et al.  Reinforcement learning models of the dopamine system and their behavioral implications , 2003 .

[76]  S. Killcross,et al.  Coordination of actions and habits in the medial prefrontal cortex of rats. , 2003, Cerebral cortex.

[77]  P. Holland Relations between Pavlovian-instrumental transfer and reinforcer devaluation. , 2004, Journal of experimental psychology. Animal behavior processes.

[78]  John N. Tsitsiklis,et al.  Bias and variance in value function estimation , 2004, ICML.

[79]  T. Robbins,et al.  The neuropsychology of ventral prefrontal cortex: Decision-making and reversal learning , 2004, Brain and Cognition.

[80]  J. Bolam,et al.  Uniform Inhibition of Dopamine Neurons in the Ventral Tegmental Area by Aversive Stimuli , 2004, Science.

[81]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[82]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[83]  B. Balleine,et al.  Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.

[84]  T. Robbins,et al.  Putting a spin on the dorsal–ventral divide of the striatum , 2004, Trends in Neurosciences.

[85]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.