Bayesian Reinforcement Learning: A Survey

Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an in-depth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: 1) it provides an elegant approach to action-selection (exploration/exploitation) as a function of the uncertainty in learning; and 2) it provides a machinery to incorporate prior knowledge into the algorithms. We first discuss models and methods for Bayesian inference in the simple single-step Bandit model. We then review the extensive recent literature on Bayesian methods for model-based RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for model-free RL, where priors are expressed over the value function or policy class. The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.

[1]  P. Levy,et al.  Calcul des Probabilites , 1926, The Mathematical Gazette.

[2]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[4]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[5]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[6]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[7]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[8]  Edward J. Wegman,et al.  Statistical Signal Processing , 1985 .

[9]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[10]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[11]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[12]  O. Zane Discrete-time Bayesian adaptive control problems with complete observations , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[13]  Sean P. Meyn,et al.  Bayesian adaptive control of time varying systems , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[14]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[15]  J. Tsitsiklis A short proof of the Gittins index theorem , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[16]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[17]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[18]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[19]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[20]  Björn Wittenmark,et al.  Adaptive Dual Control Methods: An Overview , 1995 .

[21]  A. Guez,et al.  Optimal adaptive control of uncertain stochastic linear systems , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[22]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[23]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[24]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[25]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[26]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[27]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[28]  Jonathan Baxter KnightCap : A chess program that learns by combining TD ( ) with game-tree search , 1998 .

[29]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[30]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[31]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[32]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[33]  Andrew Tridgell,et al.  KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[34]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[35]  Ilan Rusnak Rafael Optimal Adaptive Control of Uncertain Stochastic Discrete Linear Systems , 1999 .

[36]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[37]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[38]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[39]  N. Filatov,et al.  Survey of adaptive dual control methods , 2000 .

[40]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[41]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[42]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[43]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[44]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[45]  Michael O. Duff,et al.  Monte-Carlo Algorithms for the Improvement of Finite-State Stochastic Controllers: Application to Bayes-Adaptive Markov Decision Processes , 2001, AISTATS.

[46]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[47]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[48]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[49]  Shie Mannor,et al.  Sparse Online Greedy Support Vector Regression , 2002, ECML.

[50]  A. Greenfield,et al.  Adaptive Control of Nonlinear Stochastic Systems by Particle Filtering , 2003, 2003 4th International Conference on Control and Automation Proceedings.

[51]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[52]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[53]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[54]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[55]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[56]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[57]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[58]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[59]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[60]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[61]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[62]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[63]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[64]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[65]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[66]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[67]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[68]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[69]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[70]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[71]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[72]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[73]  Joelle Pineau,et al.  Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[74]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[75]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[76]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[77]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[78]  Peter Szabó,et al.  Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[79]  Yaakov Engel,et al.  Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[80]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[81]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[82]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[83]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[84]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[85]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[86]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[87]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[88]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[89]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[90]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[91]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[92]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[93]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[94]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[95]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[96]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[97]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[98]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[99]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[100]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[101]  Sriraam Natarajan,et al.  Transfer in variable-reward hierarchical reinforcement learning , 2008, Machine Learning.

[102]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[103]  Risto Miikkulainen,et al.  Online kernel selection for Bayesian reinforcement learning , 2008, ICML '08.

[104]  Joelle Pineau,et al.  Bayesian reinforcement learning in continuous POMDPs with application to robot navigation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[105]  Joelle Pineau,et al.  Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[106]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[107]  Don H. Johnson,et al.  Statistical Signal Processing , 2009, Encyclopedia of Biometrics.

[108]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[109]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[110]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[111]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[112]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[113]  Joelle Pineau,et al.  A bayesian reinforcement learning approach for customizing human-robot interfaces , 2009, IUI.

[114]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .

[115]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[116]  Brahim Chaib-draa,et al.  Bayesian reinforcement learning in continuous POMDPs with gaussian processes , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[117]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[118]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[119]  Alessandro Lazaric,et al.  Bayesian Multi-Task Reinforcement Learning , 2010, ICML.

[120]  Nicholas R. Jennings,et al.  Cooperative Games with Overlapping Coalitions , 2010, J. Artif. Intell. Res..

[121]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[122]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[123]  Doina Precup,et al.  Smarter Sampling in Model-Based Bayesian Reinforcement Learning , 2010, ECML/PKDD.

[124]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[125]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[126]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[127]  Joelle Pineau,et al.  PAC-Bayesian Model Selection for Reinforcement Learning , 2010, NIPS.

[128]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[129]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off , 2011, ICML 2011.

[130]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of Contextual Bandits , 2011, NIPS.

[131]  Christos Dimitrakakis,et al.  Bayesian Multitask Inverse Reinforcement Learning , 2011, EWRL.

[132]  TaeChoong Chung,et al.  Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[133]  Joelle Pineau,et al.  Bayesian reinforcement learning for POMDP-based dialogue systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[134]  José Niòo-Mora Computing a Classic Index for Finite-Horizon Bandits , 2011 .

[135]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[136]  Kee-Eung Kim,et al.  MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[137]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[138]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[139]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[140]  Joelle Pineau,et al.  PAC-Bayesian Policy Evaluation for Reinforcement Learning , 2011, UAI.

[141]  José Niño-Mora,et al.  Computing a Classic Index for Finite-Horizon Bandits , 2011, INFORMS J. Comput..

[142]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[143]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[144]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[145]  L. F. Bertuccelli,et al.  Robust Adaptive Markov Decision Processes: Planning with Model Uncertainty , 2012, IEEE Control Systems.

[146]  Jonathan P. How,et al.  Improving the efficiency of Bayesian inverse reinforcement learning , 2012, 2012 IEEE International Conference on Robotics and Automation.

[147]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[148]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[149]  Kee-Eung Kim,et al.  Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions , 2012, NIPS.

[150]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[151]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[152]  Jonathan P. How,et al.  Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[153]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[154]  Michael H. Bowling,et al.  Tractable Objectives for Robust Policy Optimization , 2012, NIPS.

[155]  Lucian Busoniu,et al.  Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[156]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[157]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[158]  J. Asmuth Model-based Bayesian Reinforcement Learning with Generalized Priors , 2013 .

[159]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[160]  Kenji Kawaguchi,et al.  A Greedy Approximation of Bayesian Reinforcement Learning with Probably Optimistic Transition Model , 2013, ArXiv.

[161]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[162]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[163]  Sudipto Guha,et al.  Stochastic Regret Minimization via Thompson Sampling , 2014, COLT.

[164]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[165]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[166]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[167]  Csaba Szepesvári,et al.  Bayesian Optimal Control of Smoothly Parameterized Systems , 2015, UAI.

[168]  Michal Valko,et al.  Bayesian Policy Gradient and Actor-Critic Algorithms , 2016, J. Mach. Learn. Res..

[169]  Lihong Li,et al.  On the Prior Sensitivity of Thompson Sampling , 2015, ALT.

[170]  Damien Ernst,et al.  Benchmarking for Bayesian Reinforcement Learning , 2016, PloS one.

[171]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[172]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .