A New Softmax Operator for Reinforcement Learning

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one’s weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study an alternative softmax operator that, among other properties, is both a non-expansion (ensuring convergent behavior in learning and planning) and differentiable (making it possible to improve decisions via gradient descent methods). We provide proofs of these properties and present empirical comparisons between various softmax operators.

[1]  Humberto Bustince,et al.  A Practical Guide to Averaging Functions , 2015, Studies in Fuzziness and Soft Computing.

[2]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[5]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[6]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[9]  Doina Precup,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[10]  Chris L. Baker,et al.  Goal Inference as Inverse Planning , 2007 .

[11]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[12]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[13]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[14]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[15]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[16]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[17]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[18]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[19]  D. Stahl,et al.  Experimental evidence on players' models of other players , 1994 .

[20]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[21]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[22]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[23]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[24]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[27]  Kevin Leyton-Brown,et al.  Beyond equilibrium: predicting human behaviour in normal form games , 2010, AAAI.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[30]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.