|In this report we show how the class of adaptive prediction methods that Sutton called \temporal di erence," or TD, methods are related to the theory of squential decision making. TD methods have been used as \adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the in uence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of long-term payo gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the non-engineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation. y The authors acknowledge their indebtedness to C. W. Anderson, who has contributed greatly to the development of the ideas presented here. We also thank S. Bradtke, J. E. Desmond, J. Franklin, J. C. Houk, A. I. Houston, and E. J. Kehoe for their helpful comments on earlier drafts of this report, and we especially thank J. W. Moore for his extremely detailed and helpful criticism. A. G. Barto acknowledges the support of the Air Force O ce of Scienti c Research, Bolling AFB, through grant AFOSR-87-0030, and the King's College Research Centre, King's College Cambridge, England, where much of this report was written. A version of this report will appear as a chapter in the forthcoming book Learning and Computational Neuroscience, M. Gabriel and J. W. Moore, editors, The MIT Press, Cambridge, MA.
[1]
J. Albus.
Mechanisms of planning and problem solving in the brain
,
1979
.
[2]
V. Borkar,et al.
Adaptive control of Markov chains, I: Finite parameter set
,
1979,
1979 18th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.
[3]
Lashon B. Booker,et al.
Intelligent Behavior as an Adaptation to the Task Environment
,
1982
.
[4]
P. Anandan,et al.
Pattern-recognizing stochastic learning automata
,
1985,
IEEE Transactions on Systems, Man, and Cybernetics.
[5]
A G Barto,et al.
Learning by statistical cooperation of self-interested neuron-like computing elements.
,
1985,
Human neurobiology.
[6]
Charles W. Anderson,et al.
Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning)
,
1986
.
[7]
Charles W. Anderson,et al.
Strategy Learning with Multilayer Connectionist Representations
,
1987
.
[8]
James A. Anderson,et al.
Neurocomputing: Foundations of Research
,
1988
.
[9]
Andrew G. Barto,et al.
From Chemotaxis to cooperativity: abstract exercises in neuronal learning strategies
,
1989
.