论文信息 - Bayes-adaptive hierarchical MDPs

Bayes-adaptive hierarchical MDPs

Reinforcement learning (RL) is an area of machine learning that is concerned with how an agent learns to make decisions sequentially in order to optimize a particular performance measure. For achieving such a goal, the agent has to choose either 1) exploiting previously known knowledge that might end up at local optimality or 2) exploring to gather new knowledge that expects to improve the current performance. Among other RL algorithms, Bayesian model-based RL (BRL) is well-known to be able to trade-off between exploitation and exploration optimally via belief planning, i.e. partially observable Markov decision process (POMDP). However, solving that POMDP often suffers from curse of dimensionality and curse of history. In this paper, we make two major contributions which are: 1) an integration framework of temporal abstraction into BRL that eventually results in a hierarchical POMDP formulation, which can be solved online using a hierarchical sample-based planning solver; 2) a subgoal discovery method for hierarchical BRL that automatically discovers useful macro actions to accelerate learning. In the experiment section, we demonstrate that the proposed approach can scale up to much larger problems. On the other hand, the agent is able to discover useful subgoals for speeding up Bayesian reinforcement learning.

TaeChoong Chung | SeungGwan Lee | Ngo Anh Vien

[1] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[2] Wei Zhang,et al. A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[3] Marc Toussaint,et al. Hierarchical Monte-Carlo Planning , 2015, AAAI.

[4] Wolfgang Ertel,et al. Monte carlo bayesian hierarchical reinforcement learning , 2014, AAMAS.

[5] Vittaldas V. Prabhu,et al. Distributed Reinforcement Learning Control for Batch Sequencing and Sizing in Just-In-Time Manufacturing Systems , 2004, Applied Intelligence.

[6] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[7] Marc Toussaint,et al. POMDP manipulation via trajectory optimization , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8] Joel Veness,et al. Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[9] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[10] Mohammad Ghavamzadeh,et al. Bayesian actor-critic algorithms , 2007, ICML '07.

[11] Mohammad Ghavamzadeh,et al. Bayesian Policy Gradient Algorithms , 2006, NIPS.

[12] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13] Marc Toussaint,et al. Model-Based Relational RL When Object Existence is Partially Observable , 2014, ICML.

[14] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[15] Nina Dethlefs,et al. Nonstrict Hierarchical Reinforcement Learning for Interactive Systems and Robots , 2014, TIIS.

[16] Richard S. Sutton,et al. Roles of Macro-Actions in Accelerating Reinforcement Learning , 1998 .

[17] G. Tesauro. Practical Issues in Temporal Difference Learning , 1992 .

[18] Nguyen Hoang Viet,et al. Heuristic Search Based Exploration in Reinforcement Learning , 2007, IWANN.

[19] TaeChoong Chung,et al. Policy Gradient Based Semi-Markov Decision Problems: Approximation and Estimation Errors , 2010, IEICE Trans. Inf. Syst..

[20] Andrew G. Barto,et al. Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[21] Michael L. Littman,et al. Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[22] Andrew G. Barto,et al. Skill Characterization Based on Betweenness , 2008, NIPS.

[23] Christopher G. Atkeson,et al. Nonparametric Model-Based Reinforcement Learning , 1997, NIPS.

[24] Joelle Pineau,et al. Bayes-Adaptive POMDPs , 2007, NIPS.

[25] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26] Chelsea C. White,et al. Procedures for the Solution of a Finite-Horizon, Partially Observed, Semi-Markov Optimization Problem , 1976, Oper. Res..

[27] Andrew Tridgell,et al. Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[28] Caro Lucas,et al. A Dynamic Fuzzy-Based Crossover Method for Genetic Algorithms , 2007 .

[29] Wolfgang Ertel,et al. Monte-Carlo tree search for Bayesian reinforcement learning , 2012, Applied Intelligence.

[30] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[31] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[32] David Hsu,et al. Monte Carlo Bayesian Reinforcement Learning , 2012, ICML.

[33] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[34] Doina Precup,et al. Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[35] Sungyoung Lee,et al. Approximate planning for bayesian hierarchical reinforcement learning , 2014, Applied Intelligence.

[36] Nguyen Hoang Viet,et al. Obstacle Avoidance Path Planning for Mobile Robot Based on Multi Colony Ant Algorithm , 2008, First International Conference on Advances in Computer-Human Interaction.

[37] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[38] Paloma Martínez,et al. Learning teaching strategies in an Adaptive and Intelligent Educational System through Reinforcement Learning , 2009, Applied Intelligence.

[39] Marc Toussaint,et al. Planning with Noisy Probabilistic Relational Rules , 2010, J. Artif. Intell. Res..

[40] Tao Wang,et al. Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[41] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[42] Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[43] David Hsu,et al. Monte Carlo Value Iteration for Continuous-State POMDPs , 2010, WAFR.

[44] Pieter Abbeel,et al. An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[45] Andrew G. Barto,et al. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[46] Nicholas Roy,et al. Efficient Planning under Uncertainty with Macro-actions , 2014, J. Artif. Intell. Res..

[47] Feng Cao,et al. Bayesian Hierarchical Reinforcement Learning , 2012, NIPS.

[48] TaeChoong Chung,et al. Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[49] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[50] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[51] Joelle Pineau,et al. An integrated approach to hierarchy and abstraction for pomdps , 2002 .

[52] TaeChoong Chung,et al. Policy Gradient Semi-markov Decision Process , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[53] David Hsu,et al. Monte Carlo Value Iteration with Macro-Actions , 2011, NIPS.

[54] Milos Hauskrecht,et al. Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[55] Andrew G. Barto,et al. Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[56] Nguyen Hoang Viet,et al. Q-Learning based Univector Field Navigation Method for Mobile Robots , 2007 .

[57] TaeChoong Chung,et al. Learning via human feedback in continuous state and action spaces , 2013, Applied Intelligence.

[58] TaeChoong Chung,et al. Natural Gradient Policy for Average Cost SMDP Problem , 2007 .

[59] Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[60] Shie Mannor,et al. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[61] Nguyen Hoang Viet,et al. Obstacle Avoidance Path Planning for Mobile Robot Based on Ant-Q Reinforcement Learning Algorithm , 2007, ISNN.

[62] Nicholas Roy,et al. PUMA: Planning Under Uncertainty with Macro-Actions , 2010, AAAI.

[63] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[64] Nguyen Hoang Viet,et al. Policy Gradient SMDP for Resource Allocation and Routing in Integrated Services Networks , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[65] Ngo Anh Vien,et al. Touch based POMDP manipulation via sequential submodular optimization , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[66] Leslie Pack Kaelbling,et al. Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[67] Andrew G. Barto,et al. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.