论文信息 - A short variational proof of equivalence between policy gradients and soft Q learning

A short variational proof of equivalence between policy gradients and soft Q learning

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

Brendan Maginnis | Pierre H. Richemond

[1] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[2] C. Villani. Optimal Transport: Old and New , 2008 .

[3] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[4] Amir Dembo,et al. Large Deviations Techniques and Applications , 1998 .

[5] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[6] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[7] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[8] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[9] F. Opitz. Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[10] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[11] Alexander J. Smola,et al. Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.