Model-based function approximation in reinforcement learning

Reinforcement learning promises a generic method for adapting agents to arbitrary tasks in arbitrary stochastic environments, but applying it to new real-world problems remains difficult, a few impressive success stories notwithstanding. Most interesting agent-environment systems have large state spaces, so performance depends crucially on efficient generalization from a small amount of experience. Current algorithms rely on model-free function approximation, which estimates the long-term values of states and actions directly from data and assumes that actions have similar values in similar states. This paper proposes model-based function approximation, which combines two forms of generalization by assuming that in addition to having similar values in similar states, actions also have similar effects. For one family of generalization schemes known as averagers, computation of an approximate value function from an approximate model is shown to be equivalent to the computation of the exact value function for a finite model derived from data. This derivation both integrates two independent sources of generalization and permits the extension of model-based techniques developed for finite problems. Preliminary experiments with a novel algorithm, AMBI (Approximate Models Based on Instances), demonstrate that this approach yields faster learning on some standard benchmark problems than many contemporary algorithms.

[1]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[2]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[3]  W. Smart,et al.  Manifold Representations for Value-Function Approximation , 2004 .

[4]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[5]  Sridhar Mahadevan,et al.  Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis , 2005, AAAI.

[6]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[7]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[8]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[9]  Nate Kohl,et al.  Reinforcement Learning Benchmarks and Bake-offs II A workshop at the 2005 NIPS conference , 2005 .

[10]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[11]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[14]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[17]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[18]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[19]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.