论文信息 - Bayesian Reinforcement Learning: A Survey

Bayesian Reinforcement Learning: A Survey

Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an in-depth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: 1) it provides an elegant approach to action-selection (exploration/exploitation) as a function of the uncertainty in learning; and 2) it provides a machinery to incorporate prior knowledge into the algorithms. We first discuss models and methods for Bayesian inference in the simple single-step Bandit model. We then review the extensive recent literature on Bayesian methods for model-based RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for model-free RL, where priors are expressed over the value function or policy class. The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.

[1] P. Levy,et al. Calcul des Probabilites , 1926, The Mathematical Gazette.

[2] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3] Karl Johan Åström,et al. Optimal control of Markov processes with incomplete state information , 1965 .

[4] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[5] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[6] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[7] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[8] Edward J. Wegman,et al. Statistical Signal Processing , 1985 .

[9] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[10] A. O'Hagan,et al. Bayes–Hermite quadrature , 1991 .

[11] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[12] O. Zane. Discrete-time Bayesian adaptive control problems with complete observations , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[13] Sean P. Meyn,et al. Bayesian adaptive control of time varying systems , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[14] C. Atkeson,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[15] J. Tsitsiklis. A short proof of the Gittins index theorem , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[16] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[17] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[18] Peter Norvig,et al. Artificial Intelligence: A Modern Approach , 1995 .

[19] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[20] Björn Wittenmark,et al. Adaptive Dual Control Methods: An Overview , 1995 .

[21] A. Guez,et al. Optimal adaptive control of uncertain stochastic linear systems , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[22] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[23] Christopher G. Atkeson,et al. A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[24] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[25] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[26] Stuart J. Russell. Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[27] Yoram Singer,et al. Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[28] Jonathan Baxter. KnightCap : A chess program that learns by combining TD ( ) with game-tree search , 1998 .

[29] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[30] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[31] Alexander J. Smola,et al. Learning with kernels , 1998 .

[32] David Haussler,et al. Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[33] Andrew Tridgell,et al. KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[34] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.

[35] Ilan Rusnak Rafael. Optimal Adaptive Control of Uncertain Stochastic Discrete Linear Systems , 1999 .

[36] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[37] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[38] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[39] N. Filatov,et al. Survey of adaptive dual control methods , 2000 .

[40] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[41] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[42] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[43] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[44] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[45] Michael O. Duff,et al. Monte-Carlo Algorithms for the Improvement of Finite-State Stochastic Controllers: Application to Bayes-Adaptive Markov Decision Processes , 2001, AISTATS.

[46] Andrew G. Barto,et al. Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[47] Carl E. Rasmussen,et al. Bayesian Monte Carlo , 2002, NIPS.

[48] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[49] Shie Mannor,et al. Sparse Online Greedy Support Vector Regression , 2002, ECML.

[50] A. Greenfield,et al. Adaptive Control of Nonlinear Stochastic Systems by Particle Filtering , 2003, 2003 4th International Conference on Control and Automation Proceedings.

[51] S. Shankar Sastry,et al. Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[52] Carl E. Rasmussen,et al. Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[53] Joelle Pineau,et al. Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[54] Shie Mannor,et al. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[55] John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[56] Shie Mannor,et al. The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[57] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[58] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[59] Peter Norvig,et al. Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[60] Craig Boutilier,et al. Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[61] David A. McAllester. Some PAC-Bayesian Theorems , 1998, COLT' 98.

[62] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[63] Peter Stone,et al. Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[64] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[65] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[66] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[67] Andrew G. Barto,et al. Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[68] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[69] Andrew W. Moore,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[70] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[71] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[72] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[73] Joelle Pineau,et al. Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[74] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[75] Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[76] Laurent El Ghaoui,et al. Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[77] Tao Wang,et al. Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[78] Peter Szabó,et al. Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[79] Yaakov Engel,et al. Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[80] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[81] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[82] Brahim Chaib-draa,et al. An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[83] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[84] J. Andrew Bagnell,et al. Maximum margin planning , 2006, ICML.

[85] Pascal Poupart,et al. Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[86] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[87] Mohammad Ghavamzadeh,et al. Bayesian Policy Gradient Algorithms , 2006, NIPS.

[88] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[89] Doina Precup,et al. Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[90] Shalabh Bhatnagar,et al. Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[91] John N. Tsitsiklis,et al. Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[92] Mohammad Ghavamzadeh,et al. Bayesian actor-critic algorithms , 2007, ICML '07.

[93] Robert E. Schapire,et al. A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[94] Joelle Pineau,et al. Bayes-Adaptive POMDPs , 2007, NIPS.

[95] Alan Fern,et al. Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[96] Csaba Szepesvári,et al. Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[97] Eyal Amir,et al. Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[98] Joelle Pineau,et al. Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[99] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[100] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[101] Sriraam Natarajan,et al. Transfer in variable-reward hierarchical reinforcement learning , 2008, Machine Learning.

[102] Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[103] Risto Miikkulainen,et al. Online kernel selection for Bayesian reinforcement learning , 2008, ICML '08.

[104] Joelle Pineau,et al. Bayesian reinforcement learning in continuous POMDPs with application to robot navigation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[105] Joelle Pineau,et al. Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[106] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.

[107] Don H. Johnson,et al. Statistical Signal Processing , 2009, Encyclopedia of Biometrics.

[108] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[109] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[110] Finale Doshi-Velez,et al. The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[111] Shie Mannor,et al. Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[112] Lihong Li,et al. A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[113] Joelle Pineau,et al. A bayesian reinforcement learning approach for customizing human-robot interfaces , 2009, IUI.

[114] Michael L. Littman,et al. A unifying framework for computational reinforcement learning theory , 2009 .

[115] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[116] Brahim Chaib-draa,et al. Bayesian reinforcement learning in continuous POMDPs with gaussian processes , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[117] Richard L. Lewis,et al. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[118] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[119] Alessandro Lazaric,et al. Bayesian Multi-Task Reinforcement Learning , 2010, ICML.

[120] Nicholas R. Jennings,et al. Cooperative Games with Overlapping Coalitions , 2010, J. Artif. Intell. Res..

[121] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[122] Joshua B. Tenenbaum,et al. Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[123] Doina Precup,et al. Smarter Sampling in Model-Based Bayesian Reinforcement Learning , 2010, ECML/PKDD.

[124] U. Rieder,et al. Markov Decision Processes , 2010 .

[125] Thomas J. Walsh,et al. Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[126] Shie Mannor,et al. Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[127] Joelle Pineau,et al. PAC-Bayesian Model Selection for Reinforcement Learning , 2010, NIPS.

[128] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[129] John Shawe-Taylor,et al. PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off , 2011, ICML 2011.

[130] John Shawe-Taylor,et al. PAC-Bayesian Analysis of Contextual Bandits , 2011, NIPS.

[131] Christos Dimitrakakis,et al. Bayesian Multitask Inverse Reinforcement Learning , 2011, EWRL.

[132] TaeChoong Chung,et al. Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[133] Joelle Pineau,et al. Bayesian reinforcement learning for POMDP-based dialogue systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[134] José Niòo-Mora. Computing a Classic Index for Finite-Horizon Bandits , 2011 .

[135] Warren B. Powell,et al. “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[136] Kee-Eung Kim,et al. MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[137] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[138] Michael L. Littman,et al. Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[139] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[140] Joelle Pineau,et al. PAC-Bayesian Policy Evaluation for Reinforcement Learning , 2011, UAI.

[141] José Niño-Mora,et al. Computing a Classic Index for Finite-Horizon Bandits , 2011, INFORMS J. Comput..

[142] Joelle Pineau,et al. A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[143] Olivier Buffet,et al. Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[144] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[145] L. F. Bertuccelli,et al. Robust Adaptive Markov Decision Processes: Planning with Model Uncertainty , 2012, IEEE Control Systems.

[146] Jonathan P. How,et al. Improving the efficiency of Bayesian inverse reinforcement learning , 2012, 2012 IEEE International Conference on Robotics and Automation.

[147] Aurélien Garivier,et al. On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[148] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[149] Kee-Eung Kim,et al. Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions , 2012, NIPS.

[150] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[151] Michèle Sebag,et al. The grand challenge of computer Go , 2012, Commun. ACM.

[152] Jonathan P. How,et al. Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[153] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[154] Michael H. Bowling,et al. Tractable Objectives for Robust Policy Optimization , 2012, NIPS.

[155] Lucian Busoniu,et al. Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[156] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[157] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[158] J. Asmuth. Model-based Bayesian Reinforcement Learning with Generalized Priors , 2013 .

[159] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[160] Kenji Kawaguchi,et al. A Greedy Approximation of Bayesian Reinforcement Learning with Probably Optimistic Transition Model , 2013, ArXiv.

[161] Liang Tang,et al. Automatic ad format selection via contextual bandits , 2013, CIKM.

[162] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[163] Sudipto Guha,et al. Stochastic Regret Minimization via Thompson Sampling , 2014, COLT.

[164] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[165] Shie Mannor,et al. Thompson Sampling for Complex Online Problems , 2013, ICML.

[166] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[167] Csaba Szepesvári,et al. Bayesian Optimal Control of Smoothly Parameterized Systems , 2015, UAI.

[168] Michal Valko,et al. Bayesian Policy Gradient and Actor-Critic Algorithms , 2016, J. Mach. Learn. Res..

[169] Lihong Li,et al. On the Prior Sensitivity of Thompson Sampling , 2015, ALT.

[170] Damien Ernst,et al. Benchmarking for Bayesian Reinforcement Learning , 2016, PloS one.

[171] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[172] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .