An Alternative Softmax Operator for Reinforcement Learning

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

[1]  R. Bellman A Markovian Decision Process , 1957 .

[2]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[3]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[6]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[7]  A. Safak,et al.  Statistical analysis of the power sum of multiple correlated log-normal components , 1993 .

[8]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[9]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  D. Stahl,et al.  Experimental evidence on players' models of other players , 1994 .

[12]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[13]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[14]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[15]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[16]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[17]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[18]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[19]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[20]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[21]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[24]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[25]  Chris L. Baker,et al.  Goal Inference as Inverse Planning , 2007 .

[26]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[27]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[28]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[29]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[30]  Kevin Leyton-Brown,et al.  Beyond equilibrium: predicting human behaviour in normal form games , 2010, AAAI.

[31]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[32]  Naftali Tishby,et al.  Trading Value and Information in MDPs , 2012 .

[33]  Doina Precup,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[36]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[37]  Humberto Bustince,et al.  A Practical Guide to Averaging Functions , 2015, Studies in Fuzziness and Soft Computing.