论文信息 - Approximate Modied Policy Iteration

Approximate Modied Policy Iteration

Modied policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or innite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: tted-value iteration, ttedQ iteration, and classication-bas ed policy iteration. We provide error propagation analysis that unies those for approximate policy and value iteration. For the classicationbased implementation, we develop a nitesample analysis that shows that MPI’s main parameter allows to control the balance between the estimation error of the classier and the overall value function approximation.

Matthieu Geist | Bruno Scherrer | Mohammad Ghavamzadeh | Inria Lille | Victor Gabillon

[1] Csaba Szepesv. Reinforcement Learning Algorithms for MDPs , 2010 .

[2] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[3] B. Scherrer,et al. Performance bound for Approximate Optimistic Policy Iteration , 2010 .

[4] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[5] Bruno Scherrer,et al. Classification-based Policy Iteration with a Critic , 2011, ICML.

[6] Uriel G. Rothblum,et al. (Approximate) iterated successive approximations algorithm for sequential decision processes , 2013, Ann. Oper. Res..

[7] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8] Robert Givan,et al. Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[9] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[10] Rémi Munos,et al. Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[11] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[12] M. Puterman,et al. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .