Gaussian Approximation for Bias Reduction in Q-Learning

Temporal-Difference off-policy algorithms are among the building blocks of reinforcement learning (RL). Within this family, Q-Learning is arguably the most famous one, which has been widely studied and extended. The update rule of Q-learning involves the use of the maximum operator to estimate the maximum expected value of the return. However, this estimate is positively biased, and may hinder the learning process, especially in stochastic environments and when function approximation is used. We introduce the Weighted Estimator as an effective solution to mitigate the negative effects of overestimation in Q-Learning. The Weighted Estimator estimates the maximum expected value as a weighted sum of the action values, with the weights being the probabilities that each action value is the maximum. In this work, we study the problem from the statistical perspective of estimating the maximum expected value of a set of random variables and provide bounds to the bias and the variance of the Weighted Estimator, showing its advantages over other estimators present in literature. Then, we derive algorithms to enable the use of the Weighted Estimator, in place of the Maximum Estimator, in online and batch RL, and we introduce a novel algorithm for deep RL. Finally, we empirically evaluate our algorithms in a large set of heterogeneous problems, encompassing discrete and continuous, low and high dimensional, deterministic and stochastic environments. Experimental results show the effectiveness of the Weighted Estimator in controlling the bias of the estimate, resulting in better performance than representative baselines and robust learning w.r.t. a large set of diverse environments.

[1]  Razvan Pascanu,et al.  Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective , 2021, ICML.

[2]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[3]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[4]  Martha White,et al.  Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[5]  Andrea Bonarini,et al.  MushroomRL: Simplifying Reinforcement Learning Research , 2020, J. Mach. Learn. Res..

[6]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[7]  Marcello Restelli,et al.  Exploiting Action-Value Uncertainty to Drive Exploration in Reinforcement Learning , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[8]  Volodymyr Kuleshov,et al.  Calibrated Model-Based Deep Reinforcement Learning , 2019, ICML.

[9]  Tian Tian,et al.  MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .

[10]  Marlos C. Machado,et al.  Generalization and Regularization in DQN , 2018, ArXiv.

[11]  Mark Collier,et al.  Deep Contextual Multi-armed Bandits , 2018, ArXiv.

[12]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[13]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[15]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[16]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[17]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[18]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[19]  Mykel J. Kochenderfer,et al.  Weighted Double Q-learning , 2017, IJCAI.

[20]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[21]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[22]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[23]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[24]  Marcello Restelli,et al.  Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems , 2017, AAAI.

[25]  Sergey Levine,et al.  Uncertainty-Aware Reinforcement Learning for Collision Avoidance , 2017, ArXiv.

[26]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[27]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[28]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[29]  Marcello Restelli,et al.  Estimating Maximum Expected Value through Gaussian Approximation , 2016, ICML.

[30]  J. Schulman,et al.  OpenAI Gym , 2016, ArXiv.

[31]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[32]  Luca Martino,et al.  Effective sample size for importance sampling based on discrepancy measures , 2016, Signal Process..

[33]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[34]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[35]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[36]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[37]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[38]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[39]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[40]  Diederik P. Kingma,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[41]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[42]  Marc G. Bellemare,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Tao Qin,et al.  Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising , 2013, NIPS.

[45]  Warren B. Powell,et al.  Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[46]  Hado van Hasselt,et al.  Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average , 2013, ArXiv.

[47]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[48]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[49]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[50]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[51]  R. Tibshirani,et al.  A bias correction for the minimum error rate in cross-validation , 2009, 0908.2904.

[52]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[53]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[54]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[55]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[56]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[57]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[58]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[59]  A. Cohen,et al.  ESTIMATION OF THE LARGER OF TWO NORMAL MEANS , 1968 .

[60]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[61]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[62]  Marcello Restelli,et al.  Propagating Uncertainty in Reinforcement Learning via Wasserstein Barycenters , 2019, NeurIPS.

[63]  Roy Fox Toward Provably Unbiased Temporal-Difference Value Estimation , 2019 .

[64]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[65]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[66]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[67]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[68]  D. BhaeiyalIshwaei,et al.  Non-existence of unbiased estimators of ordered parameters , 1985 .