Exploration versus Exploitation in Reinforcement Learning: A Stochastic Control Approach

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed. We characterize the cost of exploration, which, for the LQ case, is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate. Finally, as the weight of exploration decays to zero, we prove the convergence of the solution of the entropy-regularized LQ problem to the one of the classical LQ problem.

[1]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[2]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[3]  Emanuel Todorov,et al.  Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system , 2007, Int. J. Control.

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[7]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[8]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[9]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[10]  Benjamin Recht,et al.  A Tour of Reinforcement Learning: The View from Continuous Control , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[11]  T. Kurtz,et al.  Stationary Solutions and Forward Equations for Controlled and Singular Martingale Problems , 2001 .

[12]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[13]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[14]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[15]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[18]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[19]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[23]  KarouiNicole El,et al.  Compactification methods in the control of degenerate diffusions: existence of an optimal control , 1987 .

[24]  Xiongzhi Chen Brownian Motion and Stochastic Calculus , 2008 .

[25]  A. Mandelbaum,et al.  Multi-armed bandits in discrete and continuous time , 1998 .

[26]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[27]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[28]  W. Fleming,et al.  On stochastic relaxed control for partially observed diffusions , 1984, Nagoya Mathematical Journal.

[29]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[30]  T. Kurtz,et al.  Existence of Markov Controls and Characterization of Optimal Markov Controls , 1998 .

[31]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[32]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[33]  A. Mandelbaum CONTINUOUS MULTI-ARMED BANDITS AND MULTIPARAMETER PROCESSES , 1987 .

[34]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[35]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[36]  Xun Yu Zhou,et al.  On the existence of optimal relaxed controls of stochastic partial differential equations , 1992 .

[37]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[38]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[39]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .