Approximate planning for bayesian hierarchical reinforcement learning

In this paper, we propose to use hierarchical action decomposition to make Bayesian model-based reinforcement learning more efficient and feasible for larger problems. We formulate Bayesian hierarchical reinforcement learning as a partially observable semi-Markov decision process (POSMDP). The main POSMDP task is partitioned into a hierarchy of POSMDP subtasks. Each subtask might consist of only primitive actions or hierarchically call other subtasks’ policies, since the policies of lower-level subtasks are considered as macro actions in higher-level subtasks. A solution for this hierarchical action decomposition is to solve lower-level subtasks first, then higher-level ones. Because each formulated POSMDP has a continuous state space, we sample from a prior belief to build an approximate model for them, then solve by using a recently introduced Monte Carlo Value Iteration with Macro-Actions solver. We name this method Monte Carlo Bayesian Hierarchical Reinforcement Learning. Simulation results show that our algorithm exploiting the action hierarchy performs significantly better than that of flat Bayesian reinforcement learning in terms of both reward, and especially solving time, in at least one order of magnitude.

[1]  Jianghao Li,et al.  Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot , 2011, Applied Intelligence.

[2]  Michael L. Littman,et al.  Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[3]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[4]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[5]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[6]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[7]  Vittaldas V. Prabhu,et al.  Distributed Reinforcement Learning Control for Batch Sequencing and Sizing in Just-In-Time Manufacturing Systems , 2004, Applied Intelligence.

[8]  Nguyen Hoang Viet,et al.  Obstacle Avoidance Path Planning for Mobile Robot Based on Ant-Q Reinforcement Learning Algorithm , 2007, ISNN.

[9]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[10]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[11]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[12]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[13]  Nicholas Roy,et al.  PUMA: Planning Under Uncertainty with Macro-Actions , 2010, AAAI.

[14]  Jürgen Schmidhuber,et al.  Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots , 2013, Front. Psychol..

[15]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[16]  Maziar Palhang,et al.  Multi-criteria expertness based cooperative Q-learning , 2012, Applied Intelligence.

[17]  Joelle Pineau,et al.  Tractable planning under uncertainty: exploiting structure , 2004 .

[18]  John R. Rose,et al.  Robust multiagent plan generation and execution with decision theoretic planners , 2004 .

[19]  Nguyen Hoang Viet,et al.  Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2009 .

[20]  Wolfgang Ertel,et al.  Monte-Carlo tree search for Bayesian reinforcement learning , 2012, Applied Intelligence.

[21]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[22]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[23]  Bo Wu,et al.  Point-based online value iteration algorithm in large POMDP , 2013, Applied Intelligence.

[24]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[25]  Jürgen Schmidhuber,et al.  Learning skills from play: Artificial curiosity on a Katana robot arm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[26]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[27]  TaeChoong Chung,et al.  Policy Gradient Semi-markov Decision Process , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[28]  David Hsu,et al.  Monte Carlo Value Iteration with Macro-Actions , 2011, NIPS.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  Nguyen Hoang Viet,et al.  Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[31]  Chelsea C. White,et al.  Procedures for the Solution of a Finite-Horizon, Partially Observed, Semi-Markov Optimization Problem , 1976, Oper. Res..

[32]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[33]  Caro Lucas,et al.  A Dynamic Fuzzy-Based Crossover Method for Genetic Algorithms , 2007 .

[34]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[35]  Ole-Christoffer Granmo,et al.  Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game , 2013, Applied Intelligence.

[36]  Peter Stone,et al.  Hierarchical model-based reinforcement learning: R-max + MAXQ , 2008, ICML '08.

[37]  David Hsu,et al.  Monte Carlo Value Iteration for Continuous-State POMDPs , 2010, WAFR.

[38]  Wolfgang Ertel,et al.  Monte carlo bayesian hierarchical reinforcement learning , 2014, AAMAS.

[39]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[40]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[41]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[42]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[43]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[44]  Joelle Pineau,et al.  An integrated approach to hierarchy and abstraction for pomdps , 2002 .

[45]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[46]  Leslie Pack Kaelbling,et al.  Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[47]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[48]  Feng Cao,et al.  Bayesian Hierarchical Reinforcement Learning , 2012, NIPS.

[49]  TaeChoong Chung,et al.  Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[50]  David Hsu,et al.  Monte Carlo Bayesian Reinforcement Learning , 2012, ICML.

[51]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[52]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[53]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[54]  Nicholas Roy,et al.  Efficient Planning under Uncertainty with Macro-actions , 2014, J. Artif. Intell. Res..

[55]  TaeChoong Chung,et al.  Learning via human feedback in continuous state and action spaces , 2013, Applied Intelligence.

[56]  Paloma Martínez,et al.  Learning teaching strategies in an Adaptive and Intelligent Educational System through Reinforcement Learning , 2009, Applied Intelligence.

[57]  Christopher G. Atkeson,et al.  Nonparametric Model-Based Reinforcement Learning , 1997, NIPS.

[58]  Shiliang Sun,et al.  A review of deterministic approximate inference techniques for Bayesian machine learning , 2013, Neural Computing and Applications.

[59]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[60]  Nguyen Hoang Viet,et al.  Heuristic Search Based Exploration in Reinforcement Learning , 2007, IWANN.

[61]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[62]  Monireh Abdoos,et al.  Hierarchical control of traffic signals using Q-learning with tile coding , 2013, Applied Intelligence.

[63]  TaeChoong Chung,et al.  Natural Gradient Policy for Average Cost SMDP Problem , 2007 .

[64]  Csaba Szepesv Algorithms for Reinforcement Learning , 2010 .

[65]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[66]  David Barber,et al.  Variational methods for Reinforcement Learning , 2010, AISTATS.