Understanding the impact of entropy on policy optimization

Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with \emph{exploration} by encouraging the selection of more stochastic policies. In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We first show that even with access to the exact gradient, policy optimization is difficult due to the geometry of the objective function. Then, we qualitatively show that in some environments, a policy with higher entropy can make the optimization landscape smoother, thereby connecting local optima and enabling the use of larger learning rates. This paper presents new tools for understanding the optimization landscape, shows that policy entropy serves as a regularizer, and highlights the challenge of designing general-purpose policy optimization algorithms.

[1]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[2]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[3]  Shin-ichi Maeda,et al.  Clipped Action Policy Gradient , 2018, ICML.

[4]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[5]  Stanislav Fort,et al.  The Goldilocks zone: Towards better understanding of neural network loss landscapes , 2018, AAAI.

[6]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[7]  Sebastian Scherer,et al.  Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution , 2017, ICML.

[8]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[9]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[10]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[11]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[12]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[13]  Dale Schuurmans,et al.  Smoothed Action Value Functions for Learning Gaussian Policies , 2018, ICML.

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Larry Rudolph,et al.  Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.

[16]  Joelle Pineau,et al.  RE-EVALUATE: Reproducibility in Evaluating Reinforcement Learning Algorithms , 2018 .

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[19]  Mingrui Wu,et al.  Gradient descent optimization of smoothed information retrieval metrics , 2010, Information Retrieval.

[20]  Yoshua Bengio,et al.  Mollifying Networks , 2016, ICLR.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[23]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[24]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[25]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[26]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[27]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[28]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[29]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[31]  Joelle Pineau,et al.  Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods , 2018, ArXiv.

[32]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[33]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[34]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NIPS 2018.

[35]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[36]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[37]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.