Shaping and policy search in reinforcement learning

To make reinforcement learning algorithms run in a reasonable amount of time, it is frequently necessary to use a well-chosen reward function that gives appropriate “hints” to the learning algorithm. But, the selection of these hints—called shaping rewards—often entails significant trial and error, and poorly chosen shaping rewards often change the problem in unanticipated ways that cause poor solutions to be learned. In this dissertation, we give a theory of reward shaping that shows how these problems can be eliminated. This theory further gives guidelines for selecting good shaping rewards that in practice give significant speedups of the learning process. We also show that shaping can allow us to use “myopic” learning algorithms and still do well. The “curse of dimensionality” refers to the observation that many simple reinforcement learning algorithms, ones based on discretization, scale exponentially with the size of the problem and are thus impractical for many applications. In this dissertation, we consider the policy search approach to reinforcement learning. Here, we wish to select a controller from among some restricted set of controllers for a task. We see that a key issue in policy search is obtaining uniformly good estimates of the quality of the controllers being considered. We show that simple Monte Carlo methods will not in general give uniformly good estimates. We then present the P EGASUS policy search method, which is derived using the surprising observation that all reinforcement learning problems can be transformed into ones in which all state transitions (given the current state and action) are deterministic. We show that PEGASUS has sample complexity that scales at most polynomially with the size of the problem, and give strong guarantees on the quality of the solutions it finds. In deriving these results, we also take the ideas of VC dimension and sample complexity that are familiar from supervised learning and apply them to the reinforcement learning setting, thus putting the two problems on a more equal footing. Finally, we apply these ideas to designing a controller for an autonomous helicopter. Autonomous helicopter flight is widely viewed as a difficult control problem. Using shaping and the PEGASUS policy search method, we are able to automatically design a stable hovering controller for a helicopter, as well as make it fly a number of challenging maneuvers taken from an RC helicopter competition. (Abstract shortened by UMI.)