Uncertainty propagation for quality assurance in Reinforcement Learning

In this paper we address the reliability of policies derived by Reinforcement Learning on a limited amount of observations. This can be done in a principled manner by taking into account the derived Q-functionpsilas uncertainty, which stems from the uncertainty of the estimators used for the MDPpsilas transition probabilities and the reward function. We apply uncertainty propagation parallelly to the Bellman iteration and achieve confidence intervals for the Q-function. In a second step we change the Bellman operator as to achieve a policy guaranteeing the highest minimum performance with a given probability. We demonstrate the functionality of our method on artificial examples and show that, for an important problem class even an enhancement of the expected performance can be obtained. Finally we verify this observation on an application to gas turbine control.

[1]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[2]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[3]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[7]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[8]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[9]  Thomas Martinetz,et al.  Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification , 2007, ICANN.

[10]  John Hallam,et al.  IEEE International Joint Conference on Neural Networks , 2005 .

[11]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[12]  Shie Mannor,et al.  Percentile optimization in uncertain Markov decision processes with application to efficient exploration , 2007, ICML '07.

[13]  Giulio D'Agostini,et al.  BAYESIAN REASONING IN DATA ANALYSIS: A CRITICAL INTRODUCTION , 2003 .

[14]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[15]  Steffen Udluft,et al.  A Neural Reinforcement Learning Approach to Gas Turbine Control , 2007, 2007 International Joint Conference on Neural Networks.

[16]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[17]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[18]  Wilfried Brauer,et al.  Fuzzy Model-Based Reinforcement Learning , 2002, Advances in Computational Intelligence and Learning.

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[21]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[22]  George Woodworth,et al.  Bayesian Reasoning in Data Analysis: A Critical Introduction , 2004 .

[23]  E. Iso,et al.  Measurement Uncertainty and Probability: Guide to the Expression of Uncertainty in Measurement , 1995 .

[24]  Leonid Peshkin,et al.  Bounds on Sample Size for Policy Evaluation in Markov Environments , 2001, COLT/EuroCOLT.

[25]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[26]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.