Learning and Sequential Decision Making

IN THIS REPORT WE SHOW HOW THE CLASS OF ADAPTIVE PREDICTION METHODS THAT SUTTON CALLED "TEMPORAL DIFFERENCE", OR TD, METHODS ARE RELATED TO THE THE- ORY OF SEQUENTIAL DECISION MAKING. TD METHODS HAVE BEEN USED AS "ADAPTIVE CRITICS" IN CONNECTIONIST LEARNING SYSTEMS,AND HAVE BEEN PROPOSED AS MODELS OF ANIMAL LEARNING IN CLASSICAL CONDITIONING EXPERIMENTS. HERE WE RELATE TD METHODS TO DECISION TASKS FORMULATED IN TERMS OF A STOCHASTIC DYNAMICAL SYSTEM WHOSE BEHAVIOR UNFOLDS OVER TIME UNDER THE INFLUENCE OF A DECISION MAKER''S ACTIONS. STRATEGIES ARE SOUGHT FOR SELECTING ACTIONS SO AS TO MAXI- MIZE A MEASURE OF LONG-TERM PAYOFF GAIN. MATHEMATICALLY, TASKS SUCH AS THIS CAN BE FORMULATED AS MARKOVIAN DECISION PROBLEMS, AND NUMEROUS METHODS HAVE BEEN PROPOSED FOR LEARNING HOW TO SOLVE SUCH PROBLEMS. WE SHOW HOW A TD METHOD CAN BE UNDERSTOOD AS A NOVEL SYNTHESIS OF CONCEPTS FROM THE THEORY OF STOCHASTIC DYNAMIC PROGRAMMING, WHICH COMPRISES THE STANDARD METHOD FOR SOLVING SUCH TASKS WHEN A MODEL OF THE DYNAMICAL SYSTEM IS AVAILABLE, AND THE THEORY OF PARAMETER ESTIMATION, WHICH PROVIDES THE APPROPRIATE CONTEXT FOR STUDYING LEARNING RULES IN THE FORM OF EQUATIONS FOR UPDATING ASSOCIA- TIVE STRENGTHS IN BEHAVIORAL MODELS, OR CONNECTION WEIGHTS IN CONNECTIONIST NETWORKS. BECAUSE THIS REPORT IS ORIENTED PRIMARILY TOWARD THE NON-ENGINEER INTERESTED IN ANIMAL LEARNING, IT PRESENTS TUTORIALS ON STOCHASTIC SEQUEN- TIAL DECISION TASKS, STOCHASTIC DYNAMIC PROGRAMMING, AND PARAMETER ESTIMATI