Taming the Noise in Reinforcement Learning via Soft Updates

Model-free reinforcement learning algorithms such as Q-learning perform poorly in the early stages of learning in noisy environments, because much effort is spent on unlearning biased estimates of the state-action function. The bias comes from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the noise in the space of optimal actions by penalizing deterministic policies at the beginning of the learning. Moreover, it enables naturally incorporating prior distributions over optimal actions when available. The stochastic nature of G-learning also makes it more cost-effective than Q-learning in noiseless but exploration-risky domains. We illustrate these ideas in several examples where G-learning results in significant improvements of the learning rate and the learning cost.

[1]  F. Downton Stochastic Approximation , 1969, Nature.

[2]  M. T. Wasan Stochastic Approximation , 1969 .

[3]  E. C. Capen,et al.  Competitive Bidding in High-Risk Situations , 1971 .

[4]  R. Thaler Anomalies: The Winner's Curse , 1988 .

[5]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[6]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[7]  Mark D. Pendrith On Reinforcement Learning of Control Actions in Noisy and Non-Markovian Domains , 1994 .

[8]  A. Harry Klopf,et al.  Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  Mark D. Pendrith,et al.  Estimator Variance in Reinforcement Learning: Theoretical Problems and Practical Solutions , 1997 .

[11]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[12]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[13]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[14]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[15]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[16]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[17]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[18]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[19]  E. Steen Rational Overoptimism (and Other Biases) , 2004 .

[20]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[23]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[24]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[25]  A. Moreno,et al.  Noisy reinforcements in reinforcement learning: some case studies based on gridworlds , 2006 .

[26]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[27]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[28]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[29]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[30]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[31]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[32]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[33]  Joelle Pineau,et al.  PAC-Bayesian Model Selection for Reinforcement Learning , 2010, NIPS.

[34]  Marc Toussaint,et al.  Approximate Inference and Stochastic Optimal Control , 2010, ArXiv.

[35]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[36]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of Contextual Bandits , 2011, NIPS.

[37]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[38]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[39]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[40]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.

[41]  Naftali Tishby,et al.  Trading Value and Information in MDPs , 2012 .

[42]  Warren B. Powell,et al.  An Intelligent Battery Controller Using Bias-Corrected Q-learning , 2012, AAAI.

[43]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[44]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[45]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[46]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[47]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.