The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation

When the sizes of the state and action spaces are large, solving MDPs can be computationally prohibitive even if the probability transition matrix is known. So in practice, a number of techniques are used to approximately solve the dynamic programming problem, including lookahead, approximate policy evaluation using an m-step return, and function approximation. In a recent paper, (Efroni et al. 2019) studied the impact of lookahead on the convergence rate of approximate dynamic programming. In this paper, we show that these convergence results change dramatically when function approximation is used in conjunction with lookout and approximate policy evaluation using an m-step return. Specifically, we show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed for the algorithm to even converge. And when this condition is met, we characterize the performance of policies obtained using such approximate policy iteration. Our results are presented for two different procedures to compute the function approximation: linear least-squares regression and gradient descent.

[1]  Dimitri Bertsekas Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control , 2021, ArXiv.

[2]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[3]  Catholijn M. Jonker,et al.  A Framework for Reinforcement Learning and Planning , 2020, ArXiv.

[4]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[5]  Jackie Kay,et al.  Local Search for Policy Iteration in Continuous Control , 2020, ArXiv.

[6]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[7]  Shie Mannor,et al.  Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning , 2018, NIPS 2018.

[8]  D. Bertsekas Reinforcement Learning and Optimal ControlA Selective Overview , 2018 .

[9]  Andrew Tridgell,et al.  TDLeaf(lambda): Combining Temporal Difference Learning with Game-Tree Search , 1999, ArXiv.

[10]  Shie Mannor,et al.  Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[11]  Devavrat Shah,et al.  Non-Asymptotic Analysis of Monte Carlo Tree Search , 2019 .

[12]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[13]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[14]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[15]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[16]  Shie Mannor,et al.  Online Planning with Lookahead Policies , 2020, NeurIPS.

[17]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Bruno Scherrer,et al.  Non-Stationary Approximate Modified Policy Iteration , 2015, ICML.

[20]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[21]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[22]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[23]  Shiqun Yin,et al.  Value-based Algorithms Optimization with Discounted Multiple-step Learning Method in Deep Reinforcement Learning , 2020, 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[24]  Shie Mannor,et al.  How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.

[25]  Mohammad Ghavamzadeh,et al.  Multi-step Greedy Reinforcement Learning Algorithms , 2020, ICML.

[26]  Nathan R. Sturtevant,et al.  Monte Carlo Tree Search with heuristic evaluations using implicit minimax backups , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[27]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[28]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.