论文信息 - Approximate planning for bayesian hierarchical reinforcement learning

Approximate planning for bayesian hierarchical reinforcement learning

In this paper, we propose to use hierarchical action decomposition to make Bayesian model-based reinforcement learning more efficient and feasible for larger problems. We formulate Bayesian hierarchical reinforcement learning as a partially observable semi-Markov decision process (POSMDP). The main POSMDP task is partitioned into a hierarchy of POSMDP subtasks. Each subtask might consist of only primitive actions or hierarchically call other subtasks’ policies, since the policies of lower-level subtasks are considered as macro actions in higher-level subtasks. A solution for this hierarchical action decomposition is to solve lower-level subtasks first, then higher-level ones. Because each formulated POSMDP has a continuous state space, we sample from a prior belief to build an approximate model for them, then solve by using a recently introduced Monte Carlo Value Iteration with Macro-Actions solver. We name this method Monte Carlo Bayesian Hierarchical Reinforcement Learning. Simulation results show that our algorithm exploiting the action hierarchy performs significantly better than that of flat Bayesian reinforcement learning in terms of both reward, and especially solving time, in at least one order of magnitude.

[1] Jianghao Li,et al. Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot , 2011, Applied Intelligence.

[2] Michael L. Littman,et al. Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[3] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[4] Shie Mannor,et al. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[5] G. Tesauro. Practical Issues in Temporal Difference Learning , 1992 .

[6] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[7] Vittaldas V. Prabhu,et al. Distributed Reinforcement Learning Control for Batch Sequencing and Sizing in Just-In-Time Manufacturing Systems , 2004, Applied Intelligence.

[8] Nguyen Hoang Viet,et al. Obstacle Avoidance Path Planning for Mobile Robot Based on Ant-Q Reinforcement Learning Algorithm , 2007, ISNN.

[9] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[10] Mohammad Ghavamzadeh,et al. Bayesian actor-critic algorithms , 2007, ICML '07.

[11] Mohammad Ghavamzadeh,et al. Bayesian Policy Gradient Algorithms , 2006, NIPS.

[12] Wei Zhang,et al. A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[13] Nicholas Roy,et al. PUMA: Planning Under Uncertainty with Macro-Actions , 2010, AAAI.

[14] Jürgen Schmidhuber,et al. Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots , 2013, Front. Psychol..

[15] Pascal Poupart,et al. Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[16] Maziar Palhang,et al. Multi-criteria expertness based cooperative Q-learning , 2012, Applied Intelligence.

[17] Joelle Pineau,et al. Tractable planning under uncertainty: exploiting structure , 2004 .

[18] John R. Rose,et al. Robust multiagent plan generation and execution with decision theoretic planners , 2004 .

[19] Nguyen Hoang Viet,et al. Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2009 .

[20] Wolfgang Ertel,et al. Monte-Carlo tree search for Bayesian reinforcement learning , 2012, Applied Intelligence.

[21] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[22] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[23] Bo Wu,et al. Point-based online value iteration algorithm in large POMDP , 2013, Applied Intelligence.

[24] Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[25] Jürgen Schmidhuber,et al. Learning skills from play: Artificial curiosity on a Katana robot arm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[26] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[27] TaeChoong Chung,et al. Policy Gradient Semi-markov Decision Process , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[28] David Hsu,et al. Monte Carlo Value Iteration with Macro-Actions , 2011, NIPS.

[29] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30] Nguyen Hoang Viet,et al. Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[31] Chelsea C. White,et al. Procedures for the Solution of a Finite-Horizon, Partially Observed, Semi-Markov Optimization Problem , 1976, Oper. Res..

[32] Andrew Tridgell,et al. Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[33] Caro Lucas,et al. A Dynamic Fuzzy-Based Crossover Method for Genetic Algorithms , 2007 .

[34] Milos Hauskrecht,et al. Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[35] Ole-Christoffer Granmo,et al. Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game , 2013, Applied Intelligence.

[36] Peter Stone,et al. Hierarchical model-based reinforcement learning: R-max + MAXQ , 2008, ICML '08.

[37] David Hsu,et al. Monte Carlo Value Iteration for Continuous-State POMDPs , 2010, WAFR.

[38] Wolfgang Ertel,et al. Monte carlo bayesian hierarchical reinforcement learning , 2014, AAMAS.

[39] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.