Planning in entropy-regularized Markov decision processes and games

We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the SmoothCruiser. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order $\tilde{\mathcal{O}}(1/\epsilon^4)$ for a desired accuracy $\epsilon$, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.

[1]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[2]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[3]  Rémi Munos,et al.  Optimistic Planning in Markov Decision Processes Using a Generative Model , 2014, NIPS.

[4]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[5]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[6]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[7]  Carmel Domshlak,et al.  Simple Regret Optimization in Online Planning for Markov Decision Processes , 2012, J. Artif. Intell. Res..

[8]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[9]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[10]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12]  Jennifer Healey,et al.  Scale-free adaptive planning for deterministic dynamics & discounted rewards , 2019, ICML.

[13]  Csaba Szepesvári,et al.  Structured Best Arm Identification with Fixed Confidence , 2017, ALT.

[14]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[15]  Rémi Munos,et al.  Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning , 2016, NIPS.

[16]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[17]  Wouter M. Koolen,et al.  Monte-Carlo Tree Search by Best Arm Identification , 2017, NIPS.

[18]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[19]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[20]  Edouard Leurent,et al.  Practical Open-Loop Optimistic Planning , 2019, ECML/PKDD.

[21]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[25]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[26]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[27]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.