论文信息 - Bayes-Adaptive Simulation-based Search with Value Function Approximation

Bayes-Adaptive Simulation-based Search with Value Function Approximation

Bayes-adaptive planning offers a principled solution to the exploration-exploitation trade-off under model uncertainty. It finds the optimal policy in belief space, which explicitly accounts for the expected effect on future rewards of reductions in uncertainty. However, the Bayes-adaptive solution is typically intractable in domains with large or continuous state spaces. We present a tractable method for approximating the Bayes-adaptive solution by combining simulation-based search with a novel value function approximation technique that generalises appropriately over belief space. Our method outperforms prior approaches in both discrete bandit tasks and simple continuous navigation and control tasks.

[1] Christian M. Ernst,et al. Multi-armed Bandit Allocation Indices , 1989 .

[2] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .

[3] N. Gordon,et al. Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[4] Stuart J. Russell,et al. Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[5] Sebastian Thrun,et al. Monte Carlo POMDPs , 1999, NIPS.

[6] Andrew G. Barto,et al. Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[7] Joelle Pineau,et al. Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[8] Michael O. Duff,et al. Design for an Optimal Probe , 2003, ICML.

[9] Tao Wang,et al. Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[10] Pascal Poupart,et al. Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[11] Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[12] David Hsu,et al. SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[13] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[14] Carl E. Rasmussen,et al. Gaussian process dynamic programming , 2009, Neurocomputing.

[15] Brahim Chaib-draa,et al. Bayesian reinforcement learning in continuous POMDPs with gaussian processes , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[17] Joel Veness,et al. Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[18] Michael L. Littman,et al. Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[19] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[20] Regina Barzilay,et al. Learning to Win by Reading Manuals in a Monte-Carlo Framework , 2011, ACL.

[21] D. Bertsekas. Approximate policy iteration: a survey and some new methods , 2011 .

[22] David Hsu,et al. Monte Carlo Bayesian Reinforcement Learning , 2012, ICML.

[23] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[24] Richard S. Sutton,et al. Temporal-difference search in computer Go , 2012, Machine Learning.

[25] Lucian Busoniu,et al. Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).