Exploiting structure and uncertainty of Bellman updates in Markov decision processes

In many real-world problems stochasticity is a critical issue for the learning process. The sources of stochasticity come from the transition model, the explorative component of the policy or, even worse, from noisy observations of the reward function. For a finite number of samples, traditional Reinforcement Learning (RL) methods provide biased estimates of the action-value function possibly leading to poor estimates, then propagated by the application of the Bellman operator. While some approaches assume that the estimation bias is the key problem in the learning process, we show that in some cases this assumption does not necessarily hold. We propose a method that exploits the structure of the Bellman update and the uncertainty of the estimation in order to better use the amount of information provided by the samples. We show theoretical considerations about this method and its relation w.r.t. Q-Learning. Moreover, we test it in environments available in literature in order to demonstrate its effectiveness against other algorithms that focus on bias and sample-efficiency.

[1]  Kunikazu Kobayashi,et al.  A Meta-learning Method Based on Temporal Difference Error , 2009, ICONIP.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Kenji Doya,et al.  Meta-learning in Reinforcement Learning , 2003, Neural Networks.

[4]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[5]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[6]  Warren B. Powell,et al.  Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[7]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[8]  Bingkun Bao,et al.  Infinite-Horizon Policy-Gradient Estimation with Variable Discount Factor for Markov Decision Process , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[9]  Naoto Yoshida,et al.  Reinforcement learning with state-dependent discount factor , 2013, 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[10]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[11]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[12]  Hado van Hasselt,et al.  Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average , 2013, ArXiv.

[13]  E. Steen Rational Overoptimism (and Other Biases) , 2004 .

[14]  Ambuj Tewari,et al.  Bounded Parameter Markov Decision Processes with Average Reward Criterion , 2007, COLT.

[15]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[16]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[17]  Marcello Restelli,et al.  Estimating Maximum Expected Value through Gaussian Approximation , 2016, ICML.