论文信息 - Planning in entropy-regularized Markov decision processes and games - 字舞流文

Planning in entropy-regularized Markov decision processes and games

We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the SmoothCruiser. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order $\tilde{\mathcal{O}}(1/\epsilon^4)$ for a desired accuracy $\epsilon$, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.

Michal Valko | Omar Darwiche Domingues | Jean-Bastien Grill | Pierre Ménard | Rémi Munos | R. Munos | Michal Valko | O. D. Domingues | Jean-Bastien Grill | Pierre Ménard

[1] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[2] Rémi Munos,et al. Optimistic Planning of Deterministic Systems , 2008, EWRL.

[3] Rémi Munos,et al. Optimistic Planning in Markov Decision Processes Using a Generative Model , 2014, NIPS.

[4] Lucian Busoniu,et al. Optimistic planning for Markov decision processes , 2012, AISTATS.

[5] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[6] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.

[7] Carmel Domshlak,et al. Simple Regret Optimization in Online Planning for Markov Decision Processes , 2012, J. Artif. Intell. Res..

[8] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[9] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[10] Rémi Munos,et al. Open Loop Optimistic Planning , 2010, COLT.

[11] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12] Jennifer Healey,et al. Scale-free adaptive planning for deterministic dynamics & discounted rewards , 2019, ICML.

[13] Csaba Szepesvári,et al. Structured Best Arm Identification with Fixed Confidence , 2017, ALT.

[14] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[15] Rémi Munos,et al. Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning , 2016, NIPS.

[16] Peter Bro Miltersen,et al. Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[17] Wouter M. Koolen,et al. Monte-Carlo Tree Search by Best Arm Identification , 2017, NIPS.

[18] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[19] Le Song,et al. SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[20] Edouard Leurent,et al. Practical Open-Loop Optimistic Planning , 2019, ECML/PKDD.

[21] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[22] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[24] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[25] Rémi Munos,et al. Bandit Algorithms for Tree Search , 2007, UAI.

[26] Thomas J. Walsh,et al. Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[27] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.