论文信息 - Regularized Fitted Q-iteration: Application to Planning

Regularized Fitted Q-iteration: Application to Planning

We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Q-iteration with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of controlling model-complexity. The algorithm is presented in detail for the case when the function space is a reproducingkernel Hilbert space underlying a user-chosen kernel function. We derive bounds on the quality of the solution and argue that data-dependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure.

[1] Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .

[2] Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[3] M. Loth,et al. Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[4] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[5] Shie Mannor,et al. The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[6] Ding-Xuan Zhou,et al. Capacity of reproducing kernel spaces in learning theory , 2003, IEEE Transactions on Information Theory.

[7] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[8] Alexander J. Smola,et al. Learning with kernels , 1998 .

[9] Lihong Li,et al. Analyzing feature generation for value-function approximation , 2007, ICML '07.

[10] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[11] A. Tsybakov,et al. Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[12] Shai Ben-David,et al. Learning Bounds for Support Vector Machines with Learned Kernels , 2006, COLT.

[13] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[14] Daniel Polani,et al. Least Squares SVM for Least Squares TD Learning , 2006, ECAI.

[15] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[16] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.

[17] Michail G. Lagoudakis,et al. Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[18] Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[19] Csaba Szepesv,et al. Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007 .

[20] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[21] Shie Mannor,et al. Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[22] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.