The Essential Dynamics Algorithm: Fast Policy Search In Continuous Worlds

This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. The algorithm can be seen as a generalization of linear quadratic control to nonlinear, non-regulation problems. A transform is presented of the stochastic MDP into a deterministic one which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results are presented in both a bicycle riding domain and the control of a robot arm on a dynamic base, a 14 dimensional state space. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. Code is available on the project’s web site.

[1]  Jr. Donald P. Gaver Statistical methods for improving simulation efficiency , 1969 .

[2]  George S. Fishman,et al.  Solution of Large Networks by Matrix Methods , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  R. Rubinstein,et al.  On the optimality and e ciency of common random numbers , 1984 .

[4]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[5]  Karl Johan Åström,et al.  Adaptive Control , 1989, Embedded Digital Control with Microcontrollers.

[6]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[7]  Marco Colombetti,et al.  Training Agents to Perform Sequential Behavior , 1994, Adapt. Behav..

[8]  B. Pasik-Duncan,et al.  Adaptive Control , 1996, IEEE Control Systems.

[9]  Ruth F. Curtain,et al.  Linear-quadratic control: An introduction , 1997, Autom..

[10]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[11]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[12]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[13]  Jette Randløv,et al.  Shaping in Reinforcement Learning by Changing the Physics of the Problem , 2000, ICML.

[14]  Andrew W. Moore,et al.  Policy Search using Paired Comparisons , 2003, J. Mach. Learn. Res..

[15]  Sebastian Thrun,et al.  Motion planning through policy search , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Martin C. Martin,et al.  The Essential Dynamics Algorithm: Essential Results , 2003 .

[17]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[18]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[19]  Martin C. Martin Controlling Cardea: Fast Policy Search in a High Dimensional Space , 2004 .