Approximate modified policy iteration and its application to the game of Tetris

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[5]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[6]  Heidi Burgiel,et al.  How to lose at Tetris , 1997, The Mathematical Gazette.

[7]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[8]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[9]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[10]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[11]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[12]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[13]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[14]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[15]  Erik D. Demaine,et al.  Tetris is Hard, Even to Approximate , 2003, COCOON.

[16]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[17]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[18]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[19]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[20]  Erik D. Demaine,et al.  Tetris is hard, even to approximate , 2002, Int. J. Comput. Geom. Appl..

[21]  Dirk P. Kroese,et al.  The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[22]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[23]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[24]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[25]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[26]  Benjamin Van Roy,et al.  Tetris: A Study of Randomized Constraint Sampling , 2006 .

[27]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[28]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[29]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[30]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[31]  Bruno Scherrer,et al.  Building Controllers for Tetris , 2009, J. Int. Comput. Games Assoc..

[32]  Bruno Scherrer,et al.  Improvements on Learning Tetris with Cross Entropy , 2009, J. Int. Comput. Games Assoc..

[33]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[34]  B. Scherrer,et al.  Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems , 2010, ICML.

[35]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[36]  B. Scherrer,et al.  Performance bound for Approximate Optimistic Policy Iteration , 2010 .

[37]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[38]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[39]  Csaba Szepesvári,et al.  Reinforcement Learning Algorithms for MDPs , 2011 .

[40]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[41]  Bruno Scherrer,et al.  Classification-based Policy Iteration with a Critic , 2011, ICML.

[42]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[43]  D. Barber,et al.  A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes , 2012, NIPS.

[44]  Matthieu Geist,et al.  Approximate Modied Policy Iteration , 2012 .

[45]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[46]  Bruno Scherrer,et al.  Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies , 2013, ArXiv.

[47]  ScherrerBruno Performance bounds for λ policy iteration and application to the game of Tetris , 2013 .

[48]  Bruno Scherrer,et al.  Performance bounds for λ policy iteration and application to the game of Tetris , 2013, J. Mach. Learn. Res..

[49]  Uriel G. Rothblum,et al.  (Approximate) iterated successive approximations algorithm for sequential decision processes , 2013, Ann. Oper. Res..

[50]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[51]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..