PAC-Bayesian Policy Evaluation for Reinforcement Learning

Bayesian priors offer a compact yet general means of incorporating domain knowledge into many learning tasks. The correctness of the Bayesian analysis and inference, however, largely depends on accuracy and correctness of these priors. PAC-Bayesian methods overcome this problem by providing bounds that hold regardless of the correctness of the prior distribution. This paper introduces the first PAC-Bayesian bound for the batch reinforcement learning problem with function approximation. We show how this bound can be used to perform model-selection in a transfer learning scenario. Our empirical results confirm that PAC-Bayesian policy evaluation is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignore them when they are misleading.

[1]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[2]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[5]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[6]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[7]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[8]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[9]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[10]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[11]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[14]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[15]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[16]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[18]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[19]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[20]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[21]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[22]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[23]  Joelle Pineau,et al.  PAC-Bayesian Model Selection for Reinforcement Learning , 2010, NIPS.

[24]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[25]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of Martingales and Multiarmed Bandits , 2011, ArXiv.

[26]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off , 2011, ICML 2011.