This dissertation examines the use of partial programming as a means of designing agents for large Markov Decision Problems. In this approach, a programmer specifies only that which they know to be correct and the system then learns the rest from experience using reinforcement learning.
In contrast to previous low-level languages for partial programming, this dissertation presents ALisp, a Lisp-based high-level partial programming language. ALisp allows the programmer to constrain the policies considered by a learning process and to express his or her prior knowledge in a concise manner. Optimally completing a partial ALisp program is shown to be equivalent to solving a Semi-Markov Decision Problem (SMDP). Under a finite memory-use condition, online learning algorithms for ALisp are proved to converge to an optimal solution of the SMDP and thus to an optimal completion of the partial program.
This dissertation then presents methods for exploiting the modularity allows an agent to ignore aspects of its current state that are irrelevant to its current decision, and therefore speeds up reinforcement learning. By decomposing representations of the value of actions along subroutine boundaries, optimality, i.e., optimality among all policies consistent with the partial program. These methods are demonstrated on two simulated taxi tasks.
Function approximation, a method for representing the value of actions, allows reinforcement learning to be applied to problems where exact methods are intractable. Soft shaping is a method for guiding an agent toward a solution without constraining the search space. Both can be integrated with ALisp. ALisp with function approximation and reward shaping is successfully applied on a difficult continuous variant of the simulated taxi task.
Together, the methods presented in this work comprise a system for agent design that allows the programmer to specify what they know, hint at what they suspect using soft shaping, and leave unspecified that which they don't know; the system then optimally completes the program through experience and takes advantage of the hierarchical structure of the specified program to speed learning.
[1]
Gérard Berry,et al.
The Esterel Synchronous Programming Language: Design, Semantics, Implementation
,
1992,
Sci. Comput. Program..
[2]
Nils J. Nilsson,et al.
Reacting, Planning, and Learning in an Autonomous Agent
,
1996,
Machine Intelligence 14.
[3]
Nils J. Nilsson,et al.
Teleo-Reactive Programs for Agent Control
,
1993,
J. Artif. Intell. Res..
[4]
Andrew W. Moore,et al.
Reinforcement Learning: A Survey
,
1996,
J. Artif. Intell. Res..
[5]
R. James Firby.
Modularity Issues in Reactive Planning
,
1996,
AIPS.
[6]
Stuart J. Russell,et al.
Reinforcement Learning with Hierarchies of Machines
,
1997,
NIPS.
[7]
Ronald E. Parr,et al.
Hierarchical control and learning for markov decision processes
,
1998
.
[8]
Doina Precup,et al.
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
,
1999,
Artif. Intell..
[9]
Leslie Pack Kaelbling,et al.
Learning Policies with External Memory
,
1999,
ICML.
[10]
Thomas G. Dietterich.
State Abstraction in MAXQ Hierarchical Reinforcement Learning
,
1999,
NIPS.
[11]
Thomas G. Dietterich.
State Abstraction in Maxq Hierarchical Reinforcement Learning Category: Reinforcement Learning and Control Preference: Oral
,
2000
.
[12]
Doina Precup,et al.
Temporal abstraction in reinforcement learning
,
2000,
ICML 2000.