From recurrent choice to skill learning: a reinforcement-learning model.

The authors propose a reinforcement-learning mechanism as a model for recurrent choice and extend it to account for skill learning. The model was inspired by recent research in neurophysiological studies of the basal ganglia and provides an integrated explanation of recurrent choice behavior and skill learning. The behavior includes effects of differential probabilities, magnitudes, variabilities, and delay of reinforcement. The model can also produce the violation of independence, preference reversals, and the goal gradient of reinforcement in maze learning. An experiment was conducted to study learning of action sequences in a multistep task. The fit of the model to the data demonstrated its ability to account for complex skill learning. The advantages of incorporating the mechanism into a larger cognitive architecture are discussed.

[1]  C. H. Honzik,et al.  Degrees of hunger, reward and non-reward, and maze learning in rats, and Introduction and removal of reward, and maze performance in rats , 1930 .

[2]  K. Spence The order of eliminating blinds in maze learning by the rat. , 1932 .

[3]  C. L. Hull The goal-gradient hypothesis and maze learning. , 1932 .

[4]  C. L. Hull The concept of the habit-family hierarchy, and maze learning. Part I. , 1934 .

[5]  Frederick Mosteller,et al.  Stochastic Models for Learning , 1956 .

[6]  R J HERRNSTEIN,et al.  Relative and absolute strength of response as a function of frequency of reinforcement. , 1961, Journal of the experimental analysis of behavior.

[7]  J. O. Urmson,et al.  The William James Lectures , 1963 .

[8]  J. L. Myers,et al.  Gain, cost, and event probability as determiners of choice behavior , 1964 .

[9]  Richard C. Atkinson,et al.  Studies in mathematical psychology , 1964 .

[10]  J. L. Myers,et al.  Contingent gains and losses in a risk-taking situation , 1965 .

[11]  A. Tversky Intransitivity of preferences. , 1969 .

[12]  A. Tversky,et al.  Substitutability and similarity in binary choices , 1969 .

[13]  R. Rescorla A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[14]  W. F. Prokasy,et al.  Classical conditioning II: Current research and theory. , 1972 .

[15]  Allen Newell,et al.  Production Systems: Models of Control Structures , 1973 .

[16]  W. Chase,et al.  Visual information processing. , 1974 .

[17]  G. Stratton University of California publications in psychology , 1976 .

[18]  R. Passingham The hippocampus as a cognitive map J. O'Keefe & L. Nadel, Oxford University Press, Oxford (1978). 570 pp., £25.00 , 1979, Neuroscience.

[19]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[20]  R. Herrnstein,et al.  Preference reversal and delayed reinforcement , 1981 .

[21]  J. E. Mazur Tests of an equivalence rule for fixed and variable reinforcer delays. , 1984 .

[22]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[23]  John R. Anderson,et al.  Discrimination of operator schemata in problem solving: Learning from examples , 1985, Cognitive Psychology.

[24]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[25]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[26]  J. E. Mazur An adjusting procedure for studying delayed reinforcement. , 1987 .

[27]  P. Langley,et al.  Production system models of learning and development , 1987 .

[28]  M. Commons The effect of delay and of intervening events on reinforcement value , 2013 .

[29]  G. Bower,et al.  From conditioning to category learning: an adaptive network model. , 1988 .

[30]  C. Watkins Learning from delayed rewards , 1989 .

[31]  P. Kop Reinforcement, choice and response strength , 1990 .

[32]  A. Newell Unified Theories of Cognition , 1990 .

[33]  John R. Anderson,et al.  A Rational Analysis of Categorization , 1990, ML.

[34]  D. G. Davis,et al.  Memory for Reward in Probabilistic Choice: Markovian and Non-Markovian Properties , 1990 .

[35]  G. E. Alexander,et al.  Basal ganglia-thalamocortical circuits: parallel substrates for motor, oculomotor, "prefrontal" and "limbic" functions. , 1990, Progress in brain research.

[36]  D. Prelec,et al.  Negative Time Preference , 1991 .

[37]  W. Schultz,et al.  Responses of monkey dopamine neurons during learning of behavioral reactions. , 1992, Journal of neurophysiology.

[38]  B. Mellers,et al.  Preferences, prices, and ratings in risky decision making. , 1992 .

[39]  N. Sanders,et al.  Journal of behavioral decision making: "The need for contextual and technical knowledge in judgmental forecasting", 5 (1992) 39-52 , 1992 .

[40]  Jerome R. Busemeyer,et al.  An adaptive approach to human decision making: Learning theory, decision theory, and human performance. , 1992 .

[41]  J. Townsend,et al.  Decision field theory: a dynamic-cognitive approach to decision making in an uncertain environment. , 1993, Psychological review.

[42]  W. Schultz,et al.  Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[43]  D. G. Davis,et al.  The process of recurrent choice. , 1993, Psychological review.

[44]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[45]  P. Killeen Mathematical principles of reinforcement , 1994 .

[46]  James C. Houk,et al.  Information Processing in Modular Circuits Linking Basal Ganglia and Cerebral Cortex , 1994 .

[47]  W. Schultz,et al.  Importance of unpredictability for reward responses in primate dopamine neurons. , 1994, Journal of neurophysiology.

[48]  S P Wise,et al.  Distributed modular architectures linking basal ganglia, cerebellum, and cerebral cortex: their role in planning and controlling action. , 1995, Cerebral cortex.

[49]  Joel L. Davis,et al.  Adaptive Critics and the Basal Ganglia , 1995 .

[50]  A. Dickinson,et al.  Reward-related signals carried by dopamine neurons. , 1995 .

[51]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[52]  Jennifer A. Mangels,et al.  A Neostriatal Habit Learning System in Humans , 1996, Science.

[53]  J. D. McGaugh,et al.  Inactivation of Hippocampus or Caudate Nucleus with Lidocaine Differentially Affects Expression of Place and Response Learning , 1996, Neurobiology of Learning and Memory.

[54]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[55]  A. Machado Learning the temporal dynamics of behavior. , 1997, Psychological review.

[56]  B. Roche,et al.  The Behavior of Organisms? , 1997 .

[57]  J. Feigenbaum,et al.  Planning Ability After Frontal and Temporal Lobe Lesions in Humans: The Effects of Selection Equivocation and Working Memory Load , 1997 .

[58]  D E Kieras,et al.  A computational theory of executive cognitive processes and multiple-task performance: Part 1. Basic mechanisms. , 1997, Psychological review.

[59]  David E. Kieras,et al.  A computational theory of executive cognitive processes and multiple-task performance: Part 2. Accounts of psychological refractory-period phenomena. , 1997 .

[60]  C. Lebiere,et al.  The Atomic Components of Thought , 1998 .

[61]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[62]  N. Chater,et al.  Rational models of cognition , 1998 .

[63]  P. Montague,et al.  A Computational Role for Dopamine Delivery in Human Decision-Making , 1998, Journal of Cognitive Neuroscience.

[64]  J. Staddon,et al.  A Dynamic Route Finder for the Cognitive Map , 1998 .

[65]  Carol A. Seger,et al.  Striatal activation during acquisition of a cognitive skill. , 1999, Neuropsychology.

[66]  M. Packard Glutamate infused posttraining into the hippocampus or caudate-putamen differentially strengthens place and response learning. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[67]  Alvin E. Roth,et al.  The effect of adding a constant to all payoffs: experimental investigation, and implications for reinforcement learning models , 1999 .

[68]  Nir Vulkan An Economist's Perspective on Probability Matching , 2000 .

[69]  A. Amos A Computational Model of Information Processing in the Frontal Cortex and Basal Ganglia , 2000, Journal of Cognitive Neuroscience.

[70]  H. Rachlin The Science of Self-Control , 2004 .

[71]  D. Kahneman,et al.  Functional Imaging of Neural Responses to Expectancy and Experience of Monetary Gains and Losses tasks with monetary payoffs , 2001 .

[72]  M. Gluck,et al.  Interactive memory systems in the human brain , 2001, Nature.

[73]  J. E. Mazur Hyperbolic value addition and general models of animal choice. , 2001, Psychological review.

[74]  C. González-Vallejo Making trade-offs: a probabilistic and context-sensitive model of choice behavior. , 2002, Psychological review.

[75]  J. E. Mazur Concurrent-chain performance in transition: Effects of terminal-link duration and individual reinforcers , 2002, Animal learning & behavior.

[76]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[77]  Clay B. Holroyd,et al.  The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. , 2002, Psychological review.

[78]  P. Montague,et al.  Activity in human ventral striatum locked to errors of reward prediction , 2002, Nature Neuroscience.

[79]  W. Estes,et al.  Traps in the route to models of memory and decision , 2002, Psychonomic bulletin & review.

[80]  Wai-Tat Fu,et al.  SNIF-ACT: A Model of Information Foraging on the World Wide Web , 2003, User Modeling.

[81]  W. Schultz,et al.  Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.

[82]  M. Delgado,et al.  Dorsal striatum responses to reward and punishment: Effects of valence and magnitude manipulations , 2003, Cognitive, affective & behavioral neuroscience.

[83]  I. Erev,et al.  Small feedback‐based decisions and their limited correspondence to description‐based decisions , 2003 .

[84]  Mike Fitzpatrick Choice , 2004, The Lancet.

[85]  John R Anderson,et al.  An integrated theory of the mind. , 2004, Psychological review.

[86]  P. Strick,et al.  Macro-architecture of basal ganglia loops with the cerebral cortex: use of rabies virus to reveal multisynaptic circuits. , 2004, Progress in brain research.

[87]  Niels Taatgen,et al.  Proceedings of the 12th International Conference on Cognitive Modeling , 2004, ICCM 2013.

[88]  John R. Anderson,et al.  Learning Real-time Over-the-shoulder Instructions in a Dynamic Task , 2004, ICCM.

[89]  Jonathan D. Cohen,et al.  The neural basis of error detection: conflict monitoring and the error-related negativity. , 2004, Psychological review.

[90]  A. Sanfey,et al.  Independent Coding of Reward Magnitude and Valence in the Human Brain , 2004, The Journal of Neuroscience.

[91]  K. Dieussaert,et al.  Proceedings of the 26th annual conference of the cognitive science society , 2004 .

[92]  I. Erev,et al.  On adaptation, maximization, and reinforcement learning among cognitive strategies. , 2005, Psychological review.

[93]  John E. Laird,et al.  Soar-RL: integrating reinforcement learning with Soar , 2005, Cognitive Systems Research.

[94]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[95]  P. Dayan,et al.  Dopamine, uncertainty and TD learning , 2005, Behavioral and Brain Functions.

[96]  W. Schultz,et al.  Behavioral and Brain Functions , 2005 .

[97]  John R. Anderson,et al.  Human Symbol Manipulation Within an Integrated Cognitive Architecture , 2005, Cogn. Sci..

[98]  Hau-San Wong,et al.  Human Computer Interaction , 2006, Encyclopedia of Multimedia.

[99]  Michael J. Frank,et al.  Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia , 2006, Neural Computation.

[100]  Wayne D. Gray,et al.  The soft constraints hypothesis: a rational analysis approach to resource allocation for interactive behavior. , 2006, Psychological review.

[101]  Wai-Tat Fu,et al.  SNIF-ACT: A Cognitive Model of User Navigation on the World Wide Web , 2007, Hum. Comput. Interact..