论文信息 - Implicitly Regularized RL with Implicit Q-Values - 字舞流文

Implicitly Regularized RL with Implicit Q-Values

The Q-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to Q. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action spaces in modern actor-critic architectures, intrinsically prevents the exact computation of a softmax. We propose to alleviate this issue by parametrizing the Q-function implicitly, as the sum of a log-policy and of a value function. We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the Q-value. We provide a theoretical analysis of our algorithm: from an Approximate Dynamic Programming perspective, we show its equivalence to a regularized version of value iteration, accounting for both entropy and Kullback-Leibler regularization, and that enjoys beneficial error propagation results. We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.

Marcin Andrychowicz | Matthieu Geist | Olivier Pietquin | Anton Raichuk | Nino Vieillard

[1] Martin A. Riedmiller,et al. Quinoa: a Q-function You Infer Normalized Over Actions , 2019, ArXiv.

[2] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[3] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[4] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.

[5] Bruno Scherrer,et al. Leverage the Average: an Analysis of Regularization in RL , 2020, ArXiv.

[6] Marcin Andrychowicz,et al. Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[7] Dale Schuurmans,et al. Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[8] Avishek Joey Bose,et al. Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies , 2019, ArXiv.

[9] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[10] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Sergio Gomez Colmenarejo,et al. Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[13] O. Pietquin,et al. Munchausen Reinforcement Learning , 2020, NeurIPS.

[14] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[15] Marc G. Bellemare,et al. Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[16] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[17] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[18] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[19] Matthieu Geist,et al. Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[20] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[21] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[22] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[23] Leon Hirsch,et al. Fundamentals Of Convex Analysis , 2016 .

[24] Matthieu Geist,et al. Is the Bellman residual a bad proxy? , 2016, NIPS.

[25] Yunhao Tang,et al. Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.

[26] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[27] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[28] Navdeep Jaitly,et al. Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.

[29] Sergey Levine,et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.