Model-Based Active Learning in Hierarchical Policies

Hierarchical task decompositions play an essential role in the design of complex simulation and decision systems, such as the ones that arise in video games. Game designers find it very natural to adopt a divide-and-conquer philosophy of specifying hierarchical policies, where decision modules can be constructed somewhat independently. The process of choosing the parameters of these modules manually is typically lengthy and tedious. The hierarchical reinforcement learning (HRL) field has produced elegant ways of decomposing policies and value functions using semi-Markov decision processes. However, there is still a lack of demonstrations in larger nonlinear systems with discrete and continuous variables. To narrow this gap between industrial practices and academic ideas, we address the problem of designing efficient algorithms to facilitate the deployment of HRL ideas in more realistic settings. In particular, we propose Bayesian active learning methods to learn the relevant aspects of either policies or value functions by focusing on the most relevant parts of the parameter and state spaces respectively. To demonstrate the scalability of our solution, we have applied it to The Open Racing Car Simulator (TORCS), a 3D game engine that implements complex vehicle dynamics. The environment is a large topological map roughly based on downtown Vancouver, British Columbia. Higher

[1]  Christos Dimitrakakis,et al.  TORCS, The Open Racing Car Simulator , 2005 .

[2]  Harold J. Kushner,et al.  A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise , 1963 .

[3]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[4]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[5]  Prasad Tadepalli,et al.  Model-based Hierarchical Average-reward Reinforcement Learning , 2002, International Conference on Machine Learning.

[6]  Nando de Freitas,et al.  Active Policy Learning for Robot Planning and Exploration under Uncertainty , 2007, Robotics: Science and Systems.

[7]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[8]  Simon Streltsov,et al.  A Non-myopic Utility Function for Statistical Global Optimization Algorithms , 1999, J. Glob. Optim..

[9]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[10]  Bruno Betrò,et al.  Bayesian methods in global optimization , 1991, J. Glob. Optim..

[11]  Piotr J. Gmytrasiewicz,et al.  Interactive dynamic influence diagrams , 2007, AAMAS '07.

[12]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[13]  M. Ghavamzadeh,et al.  Hierarchical reinforcement learning in continuous state and multi-agent environments , 2005 .

[14]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[15]  Daphne Koller,et al.  Active Learning for Structure in Bayesian Networks , 2001, IJCAI.

[16]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[17]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[18]  Michael L. Littman,et al.  A hierarchical approach to efficient reinforcement learning in deterministic domains , 2006, AAMAS '06.

[19]  Bhaskara Marthi,et al.  Concurrent Hierarchical Reinforcement Learning , 2005, IJCAI.

[20]  Marco Locatelli,et al.  Bayesian Algorithms for One-Dimensional Global Optimization , 1997, J. Glob. Optim..

[21]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[22]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[23]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[24]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[25]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[26]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[27]  A. Zilinskas,et al.  Global optimization based on a statistical model and simplicial partitioning , 2002 .

[28]  Andreas Krause,et al.  Near-optimal sensor placements in Gaussian processes , 2005, ICML.

[29]  David Andre,et al.  Programmable Reinforcement Learning Agents , 2000, NIPS.

[30]  Joelle Pineau,et al.  A Hierarchical Approach to POMDP Planning and Execution , 2004 .