Lambda-Policy Iteration: A Review and a New Implementation

In this paper we discuss -policy iteration, a method for exact and approximate dynamic programming. It is intermediate between the classical value iteration (VI) and policy iteration (PI) methods, and it is closely related to optimistic (also known as modied) PI, whereby each policy evaluation is done approximately, using a nite number of VI. We review the theory of the method and associated questions of bias and exploration arising in simulation-based cost function approximation. We then discuss various implementations, which oer advantages over well-established PI methods that use LSPE( ), LSTD( ), or TD( ) for policy evaluation with cost function approximation. One of these implementations is based on a new simulation scheme, called geometric sampling, which uses multiple short trajectories rather than a single innitely long trajectory.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[3]  M. A. Krasnoselʹskii,et al.  Approximate Solution of Operator Equations , 1972 .

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  C. Fletcher Computational Galerkin Methods , 1983 .

[6]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[13]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[14]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[15]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[16]  Ying He,et al.  Simulation-Based Algorithms for Markov Decision Processes , 2002 .

[17]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[18]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[19]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[20]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[21]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[22]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[23]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[24]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[25]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[26]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[27]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[28]  Bruno Scherrer,et al.  Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris , 2007 .

[29]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[30]  Frank L. Lewis,et al.  Guest Editorial: Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[31]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[32]  Vivek S. Borkar,et al.  Reinforcement Learning — A Bridge Between Numerical Methods and Monte Carlo , 2009 .

[33]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[34]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[35]  Paul J. Werbos,et al.  2009 Special Issue: Intelligence in the brain: A theory of how it works and how to build it , 2009 .

[36]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[37]  D. Bertsekas,et al.  Approximate Solution of Large-Scale Linear Inverse Problems with Monte Carlo Simulation ∗ , 2009 .

[38]  D. Bertsekas Projected Equations, Variational Inequalities, and Temporal Difference Methods , 2009 .

[39]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[40]  Bruno Scherrer,et al.  Improvements on Learning Tetris with Cross Entropy , 2009, J. Int. Comput. Games Assoc..

[41]  Dimitri P. Bertsekas,et al.  Distributed asynchronous policy iteration in dynamic programming , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[42]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[43]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, CDC.

[44]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[45]  B. Scherrer,et al.  Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems , 2010, ICML.

[46]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[47]  B. Scherrer,et al.  Performance bound for Approximate Optimistic Policy Iteration , 2010 .

[48]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[49]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[50]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[51]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[52]  Dimitri P. Bertsekas,et al.  Pathologies of temporal difference methods in approximate dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[53]  Csaba Szepesvári,et al.  Reinforcement Learning Algorithms for MDPs , 2011 .

[54]  Silvia Ferrari,et al.  A cell decomposition approach to online evasive path planning and the video game Ms. Pac-Man , 2011, 2011 IEEE International Symposium on Intelligent Control.

[55]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[56]  Greg Foderaro,et al.  A model-based approximate λ-policy iteration approach to online evasive path planning and the video game Ms. Pac-Man , 2011 .

[57]  Dimitri P. Bertsekas,et al.  Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[58]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[59]  D. Bertsekas,et al.  On the Convergence of Iterative Simulation-Based Methods for Singular Linear Systems , 2012 .

[60]  Dimitri P. Bertsekas,et al.  Q-learning and policy iteration algorithms for stochastic shortest path problems , 2012, Annals of Operations Research.

[61]  Bruno Scherrer,et al.  Performance bounds for λ policy iteration and application to the game of Tetris , 2013, J. Mach. Learn. Res..

[62]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[63]  Uriel G. Rothblum,et al.  (Approximate) iterated successive approximations algorithm for sequential decision processes , 2013, Ann. Oper. Res..

[64]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.