We propose a new approach to the problem of searching a space of stochastic controllers for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). Following several other authors, our approach is based on searching in parameterized families of policies (for example, via gradient descent) to optimize solution quality. However, rather than trying to estimate the values and derivatives of a policy directly, we do so indirectly using estimates for the probability densities that the policy induces on states at the different points in time. This enables our algorithms to exploit the many techniques for efficient and robust approximate density propagation in stochastic systems. We show how our techniques can be applied both to deterministic propagation schemes (where the MDP's dynamics are given explicitly in compact form,) and to stochastic propagation schemes (where we have access only to a generative model, or simulator, of the MDP). We present empirical results for both of these variants on complex problems.
[1]
Stuart J. Russell,et al.
The BATmobile: Towards a Bayesian Automated Taxi
,
1995,
IJCAI.
[2]
Preben Alstrøm,et al.
Learning to Drive a Bicycle Using Reinforcement Learning and Shaping
,
1998,
ICML.
[3]
Craig Boutilier,et al.
Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
,
1999,
J. Artif. Intell. Res..
[4]
Andrew W. Moore,et al.
Gradient Descent for General Reinforcement Learning
,
1998,
NIPS.
[5]
Xavier Boyen,et al.
Tractable Inference for Complex Stochastic Processes
,
1998,
UAI.
[6]
Daphne Koller,et al.
Using Learning for Approximation in Stochastic Processes
,
1998,
ICML.
[7]
Kee-Eung Kim,et al.
Learning Finite-State Controllers for Partially Observable Environments
,
1999,
UAI.