Fitted Q-iteration in continuous action-space MDPs

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems.

[1]  A. Kolmogorov,et al.  Entropy and "-capacity of sets in func-tional spaces , 1961 .

[2]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[3]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[4]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[5]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[6]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[9]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[12]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[15]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[16]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[17]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Douglas Aberdeen,et al.  Policy-Gradient Methods for Planning , 2005, NIPS.

[20]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[21]  Ambuj Tewari,et al.  Sample Complexity of Policy Search with Known Dynamics , 2006, NIPS.

[22]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[23]  A. Antos,et al.  Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[24]  Peter Stone,et al.  Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[25]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[26]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.