Approximate Policy Iteration Schemes: A Comparison

We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration (API) (Bertsekas & Tsitsiklis, 1996), Conservative Policy Iteration (CPI) (Kakade & Langford, 2002), a natural adaptation of the Policy Search by Dynamic Programming algorithm (Bagnell et al., 2003) to the infinite-horizon case (PSDP∞), and the recently proposed Non-Stationary Policy Iteration (NSPI(m)) (Scherrer & Lesner, 2012). For all algorithms, we describe performance bounds with respect the per-iteration error e, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API, but this comes at the cost of a relative--exponential in 1/e increase of the number of iterations. 2) PSDP1 enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP∞ is proportional to their number of iterations, which may be problematic when the discount factor γ is close to 1 or the approximation error e is close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.

[1]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[2]  Alessandro Lazaric,et al.  Conservative and Greedy Approaches to Classification-Based Policy Iteration , 2012, AAAI.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[5]  Bruno Scherrer,et al.  Performance bounds for λ policy iteration and application to the game of Tetris , 2013, J. Mach. Learn. Res..

[6]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[7]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[8]  Bruno Scherrer,et al.  On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes , 2012, NIPS.

[9]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[10]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[11]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[12]  Bruno Scherrer,et al.  Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris , 2007 .

[13]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[16]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[17]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.