论文信息 - Potential-Based Reward Shaping for POMDPs (Extended Abstract)

Potential-Based Reward Shaping for POMDPs (Extended Abstract)

We address the problem of suboptimal behavior caused by short horizons during online POMDP planning. Our solution extends potential-based reward shaping from the related field of reinforcement learning to online POMDP planning in order to improve planning without increasing the planning horizon. In our extension, information about the quality of belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards, and thus achieve greater cumulative rewards.

[1] Joelle Pineau,et al. Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[2] Olivier Buffet,et al. A POMDP Extension with Belief-dependent Rewards , 2010, NIPS.

[3] Richard L. Lewis,et al. Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents , 2011, AAAI.

[4] Michael L. Littman,et al. Potential-based Shaping in Model-based Reinforcement Learning , 2008, AAAI.

[5] Sam Devlin,et al. Dynamic potential-based reward shaping , 2012, AAMAS.

[6] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[7] Xiaoping Chen,et al. FHHOP: A Factored Hybrid Heuristic Online Planning Algorithm for Large POMDPs , 2012, UAI.

[8] Reid G. Simmons,et al. Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[9] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[10] D.A. Castanon,et al. Rollout Algorithms for Stochastic Scheduling Problems , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).