Extending the Computational Abilities of the Procedural Learning Mechanism in ACT-R Wai-Tat Fu (wfu@cmu.edu) John R. Anderson (ja+@cmu.edu) Department of Psychology, Carnegie Mellon University Pittsburgh, PA 15213, USA by-case feedback. For example, consider the case where the probabilities of effectiveness of three treatments 1, 2, and 3 are as shown in Figure 1. Since the effectiveness of each treatment follows a continuous distribution, a simple binary feedback function is obviously insufficient to represent the information received from the feedback. Abstract P(Effectiveness) The existing procedural learning mechanism in ACT-R (Anderson & Lebiere, 1998) has been successful in explaining a wide range of adaptive choice behavior. However, the existing mechanism is inherently limited to learning from binary feedback (i.e. whether a reward is received or not). It is thus difficult to capture choice behavior that is sensitive to both the probabilities of receiving a reward and the reward magnitudes. By modifying the temporal difference learning algorithm (Sutton & Barto, 1998), a new procedural learning mechanism is implemented that generalizes and extends the computational abilities of the current mechanism. Models using the new mechanism were fit to three sets of human data collected from experiments of probability learning and decision making tasks. The new procedural learning mechanism fit the data at least as well as the existing mechanism, and is able to fit data that are problematic for the existing mechanism. This paper also shows how the principle of reinforcement learning can be implemented in a production system like ACT-R. Treatment 1 Treatment 2 Treatment 3 Effectiveness Figure 1. Probability of effectiveness of three treatments. Another motivation for extending the current mechanism comes from recent findings of the functional role of dopaminergic signals in basal ganglia during procedural learning. Research shows that learning is driven by the deviation between the expected and actual reward (Schultz et al., 1995; Schultz, Dayan, & Montague, 1997). In other words, the reward magnitude is often processed as a scalar quantity – depending on whether the magnitude of the actual reward is higher or lower than expected, a positive or negative reinforcement signal is generated respectively. The pre-specification of correct and incorrect responses is therefore inconsistent with the current understanding of the procedural learning mechanism in basal ganglia. Introduction Human choice behavior is often studied under various probability learning situations. In a typical probability learning situation, participants are asked to select one of the many options available, and feedback on whether the choice is correct or not is given after the selection. There are usually two main manipulations in a probability learning task: (1) the probabilities for each of the options being correct, and (2) the magnitudes of reward (usually monetary) received when the correct option is selected. One robust result is that people tend to choose the options a proportion of time equal to their probabilities of being correct – a phenomenon often called “probability matching” (e.g. Friedman et al., 1964). However, when the reward magnitudes are varied, the observed choice probabilities are sometimes larger or smaller than the outcome probabilities (e.g. Myers, Fort, Katz, & Suydam, 1963). These studies show consistently that people are sensitive to both outcome probabilities and reward magnitudes in making choices. One limitation of the current ACT-R procedural learning mechanism (Lovett, 1998) is that it requires a pre- specification of correct and incorrect responses. Besides, feedback received is limited to a binary function (i.e. whether a reward is received or not). Apparently, a simple binary function may not be sufficient to represent the feedback from the environment. For example, imagine a situation in which there are several possible treatments for a particular disease and a physician has to choose a treatment that has the highest expected effectiveness. One may have to evaluate the effectiveness of each treatment through case- The ACT-R 5.0 architecture Figure 2 shows the basic architecture of the ACT-R 5.0 system. The core of the system is a set of production rules that represents procedural memory. Production rules coordinate actions in each of the separate modules. The modules communicate to each other through its buffer, which holds information necessary for the interaction between the system and the external world. Anderson, Qin, Sohn, Stenger, and Carter (2003) showed that the activity in these buffers match well to the activities in certain cortical areas (see Figure 2). The basal ganglia are hypothesized to implement production rules in ACT-R, which match and act on patterns of activity in the buffers. This is consistent with a typical ACT-R cycle in which production rules are matched to the pattern of activity in the buffers, a production is selected and fired, and the contents in the buffers updated. In ACT-R, when there is more than one production matching the pattern of buffer activity, the system selects a production based on a conflict resolution mechanism. The basis of the conflict resolution mechanism is the computation of expected utility, which captures the
[1]
Michael J. Frank,et al.
Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia
,
2006,
Neural Computation.
[2]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[3]
Mike Fitzpatrick.
Choice
,
2004,
The Lancet.
[4]
John R. Anderson,et al.
An information-processing model of the BOLD response in symbol manipulation tasks
,
2003,
Psychonomic bulletin & review.
[5]
Clay B. Holroyd,et al.
The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity.
,
2002,
Psychological review.
[6]
J. E. Mazur,et al.
Hyperbolic value addition and general models of animal choice.
,
2001,
Psychological review.
[7]
D. Geary,et al.
Psychonomic Bulletin Review
,
2000
.
[8]
C. Lebiere,et al.
The Atomic Components of Thought
,
1998
.
[9]
M. Botvinick,et al.
Anterior cingulate cortex, error detection, and the online monitoring of performance.
,
1998,
Science.
[10]
Peter Dayan,et al.
A Neural Substrate of Prediction and Reward
,
1997,
Science.
[11]
A. Dickinson,et al.
Reward-related signals carried by dopamine neurons.
,
1995
.
[12]
Joel L. Davis,et al.
Adaptive Critics and the Basal Ganglia
,
1995
.
[13]
W. Schultz,et al.
Importance of unpredictability for reward responses in primate dopamine neurons.
,
1994,
Journal of neurophysiology.
[14]
Joel L. Davis,et al.
A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement
,
1994
.
[15]
John R. Anderson,et al.
Rules of the Mind
,
1993
.
[16]
W. Schultz,et al.
Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task
,
1993,
The Journal of neuroscience : the official journal of the Society for Neuroscience.
[17]
Jerome R. Busemeyer,et al.
An adaptive approach to human decision making: Learning theory, decision theory, and human performance.
,
1992
.
[18]
W. Schultz,et al.
Responses of monkey dopamine neurons during learning of behavioral reactions.
,
1992,
Journal of neurophysiology.
[19]
D. Prelec,et al.
Negative Time Preference
,
1991
.
[20]
A G Barto,et al.
Toward a modern theory of adaptive networks: expectation and prediction.
,
1981,
Psychological review.
[21]
R. Rescorla.
A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement
,
1972
.
[22]
W. F. Prokasy,et al.
Classical conditioning II: Current research and theory.
,
1972
.
[23]
J. L. Myers,et al.
DIFFERENTIAL MONETARY GAINS AND LOSSES AND EVENT PROBABILITY IN A TWO-CHOICE SITUATION.
,
1963,
Journal of experimental psychology.