Representations for learning control policies

Representing the expected reward or cost for taking an action in a stochastic control problem, such as automated driving, is not trivial when the state and action spaces are continuous. Simple techniques can su er from forgetting, where the lessons learned when the agent was doing poorly (e.g. how to recover when the car is headed o the road) can be forgotten when the agent has learned a policy that works in the majority of cases. This paper presents and demonstrates a method that e ectively learns and maintains a Q function approximation utilizing stored instances of past observations. This instance-based reinforcement learning algorithm has the necessary extensions and optimizations required to learn in complex control domains. To make further use of the stored examples, our method learns a model of the environment and uses that model to improve the estimate of the value of taking actions in states. We explore several techniques for choosing how to use the model to eÆciently improve the value function, and present an original algorithm based on generalized prioritized sweeping that outperforms the others on two example driving tasks. In our experience, the task of learning to control an autonomous vehicle is best formulated as a stochastic optimal control problem. Reinforcement learning (RL) algorithms can learn optimal behavior for such problems from trial and error interactions with the environment. However, reinforcement learning algorithms often are unable to e ectively learn policies for domains with certain properties: continuous state and action spaces, a need for real-time online operation, and continuous long-term operation. Driving is a particularly challenging problem, since the task itself changes over time. A lane following agent may become pro cient at negotiating curved roads and then go on a long straight stretch where it becomes even more pro cient on straight roads. It should not, however, lose pro ciency at the curved roads. The overall goal of staying in the center of the lane remains the same, but the kind of states with which the agent is faced changes when moving from curved roads to straight and back again. Many learning algorithms are vulnerable to catastrophic interference where, after experiencing numerous new examples in a di erent part of the state space, accuracy on older examples can decrease. This behavior is referred to as forgetting. As in real life, forgetting is obviously inadvisable for any learning control algorithm. Instance-based learners are an example of nonparametric learners and thus avoid the problem of forgetting. Applying them directly to reinforcement learning, however, is not entirely trivial, as the reinforcement learning problem is not a supervised learning problem, but a delayed reinforcement problem. Furthermore, instance based techniques use a lot of memory, and are susceptible to the magnitudes of the inputs. This paper presents techniques that avoid the above diÆculties by using a value-updating algorithm, instance averaging, and automatic dimension scaling. Finally, reinforcement learning can require many runs in the environment to learn a successful policy. We mitigate this by using memory-based reinforcement learning, which learns a model of the environment and uses it to improve the value function without taking steps in the actual environment. Many methods have been suggested for how best to use a model to update the value function. We utilize a novel method for our representation based on generalized prioritized sweeping (Andre et al., 1997). We demonstrate our methods on two simulated driving tasks. The structure of this paper is as follows. Section 1 describes the instance-based representation for value function approximation and how it can be used to learn control policies e ectively from experience. This section also describes the extensions that were necessary for the representation to be practical for vehicle control. Section 2 shows how one can use a structured domain model to learn more eÆciently and handle fundamental autonomous vehicle tasks. Section 3 presents some empirical results on learning to control a simulated vehicle to steer itself in the center of the lane. The paper ends with conclusions and acknowledgments. 1. Instance-based value reinforcement