论文信息 - Non-Stationary Approximate Modified Policy Iteration

Non-Stationary Approximate Modified Policy Iteration

We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration--a family of algorithms that can interpolate between Value and Policy Iteration--with an error e at each iteration is known to lead to stationary policies that are at least 2γe/(1-γ)2-optimal. Variations of Value and Policy Iteration, that build l-periodic nonstationary policies, have recently been shown to display a better 2γe/(1-γ)(1-γl)-optimality guarantee. We describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m ≥ 0 and l ≥ 1 that generalizes all the above mentionned algorithms. While m allows one to interpolate between Value-Iteration-style and Policy-Iteration-style updates, l specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved 2γe/(1-γ)(1-γl)-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and l; this tightness was to our knowledge only known in two specific cases, Value Iteration (m = 0,l = 1) and Policy Iteration (m = ∞,l = 1).

Bruno Scherrer | Boris Lesner | B. Scherrer | Boris Lesner

[1] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[2] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[3] Bruno Scherrer,et al. On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes , 2012, NIPS.

[4] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[5] A. Antos,et al. Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[6] Rémi Munos,et al. Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[7] Matthieu Geist,et al. Approximate Modified Policy Iteration , 2012, ICML.

[8] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[9] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10] Bruno Scherrer,et al. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[11] Dimitri P. Bertsekas,et al. Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[12] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[13] M. Puterman,et al. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[14] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15] Csaba Szepesv,et al. Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007 .

[16] Uriel G. Rothblum,et al. (Approximate) iterated successive approximations algorithm for sequential decision processes , 2013, Ann. Oper. Res..