论文信息 - Stochastic Convergence Results for Regularized Actor-Critic Methods - 字舞流文

Stochastic Convergence Results for Regularized Actor-Critic Methods

In this paper, we present a stochastic convergence proof, under suitable conditions, of a certain class of actor-critic algorithms for finding approximate solutions to entropy-regularized MDPs using the machinery of stochastic approximation. To obtain this overall result, we provide three fundamental results that are all of both practical and theoretical interest: we prove the convergence of policy evaluation with general regularizers when using linear approximation architectures, we derive an entropy-regularized policy gradient theorem, and we show convergence of entropy-regularized policy improvement. We also provide a simple, illustrative empirical study corroborating our theoretical results. To the best of our knowledge, this is the first time such results have been provided for approximate solution methods for regularized MDPs.

Kaiqing Zhang | Zhuoran Yang | Wesley Suttle | Ji Liu

[1] G. Pflug. Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[2] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.

[3] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[4] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[5] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[6] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[7] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[8] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[9] L. A. Prashanth,et al. Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games , 2014 .

[10] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[11] Hilbert J. Kappen,et al. Dynamic policy programming , 2010, J. Mach. Learn. Res..

[12] Koray Kavukcuoglu,et al. Combining policy gradient and Q-learning , 2016, ICLR.

[13] Tamer Basar,et al. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[14] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[15] M. T. Wasan. Stochastic Approximation , 1969 .

[16] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[17] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[18] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[19] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[20] Le Song,et al. SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[21] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[22] V. Borkar,et al. Stochastic approximation , 2013, Resonance.