Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning

Recent Reinforcement Learning (RL) algorithms making use of KullbackLeibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a very strong performance bound, the very first to combine two desirable aspects: a linear dependency to the horizon (instead of quadratic) and an error propagation term involving an averaging e ect of the estimation errors (instead of an accumulation e ect). We also study the more general case of an additional entropy regularizer. The resulting abstract scheme encompasses many existing RL algorithms. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.

[1]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[2]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[3]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[4]  Kenji Doya,et al.  Theoretical Analysis of Efficiency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning , 2019, AISTATS.

[5]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[6]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[7]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[8]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[9]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[10]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[11]  Bruno Scherrer,et al.  Momentum in Reinforcement Learning , 2020, AISTATS.

[12]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[13]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[14]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[17]  Bruno Scherrer,et al.  On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes , 2012, NIPS.

[18]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[19]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[20]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[23]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[24]  Matthieu Geist,et al.  Munchausen Reinforcement Learning , 2020, NeurIPS.

[25]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[26]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[27]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[28]  Matthieu Geist,et al.  Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[29]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[30]  Lawrence Carin,et al.  Revisiting the Softmax Bellman Operator: New Benefits and New Perspective , 2018, ICML.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[33]  Matthieu Geist,et al.  Difference of Convex Functions Programming for Reinforcement Learning , 2014, NIPS.

[34]  Matthieu Geist,et al.  Is the Bellman residual a bad proxy? , 2016, NIPS.

[35]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[36]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[37]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[38]  Matthew Fellows,et al.  VIREL: A Variational Inference Framework for Reinforcement Learning , 2018, NeurIPS.

[39]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[40]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.