Bayesian exploration and interactive demonstration in continuous state MAXQ-learning

Deploying robots for service tasks requires learning algorithms that scale to the combinatorial complexity of our daily environment. Inspired by the way humans decompose complex tasks, hierarchical methods for robot learning have attracted significant interest. In this paper, we apply the MAXQ method for hierarchical reinforcement learning to continuous state spaces. By using Gaussian Process Regression for MAXQ value function decomposition, we obtain probabilistic estimates of primitive and completion values for every subtask within the MAXQ hierarchy. From these, we recursively compute probabilistic estimates of state-action values. Based on the expected deviation of these estimates, we devise a Bayesian exploration strategy that balances optimization of expected values and risk from exploring unknown actions. To further reduce risk and to accelerate learning, we complement MAXQ with learning from demonstrations in an interactive way. In every situation and subtask, the system may ask for a demonstration if there is not enough knowledge available to determine a safe action for exploration. We demonstrate the ability of the proposed system to efficiently learn solutions to complex tasks on a box stacking scenario.

[1]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[2]  Peter Stone,et al.  Compositional Models for Reinforcement Learning , 2009, ECML/PKDD.

[3]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[4]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[5]  Scott Kuindersma,et al.  Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[6]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[7]  Jörg Stückler,et al.  Improving imitated grasping motions through interactive expected deviation learning , 2010, 2010 10th IEEE-RAS International Conference on Humanoid Robots.

[8]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[9]  Stefan Schaal,et al.  Hierarchical reinforcement learning with movement primitives , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[10]  Bernhard Hengst,et al.  Hierarchical Approaches , 2012, Reinforcement Learning.

[11]  Feng Cao,et al.  Bayesian Hierarchical Reinforcement Learning , 2012, NIPS.

[12]  Shie Mannor,et al.  Bayesian Reinforcement Learning , 2012, Reinforcement Learning.

[13]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[14]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[15]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[16]  Oliver Kroemer,et al.  Learning sequential motor tasks , 2013, 2013 IEEE International Conference on Robotics and Automation.

[17]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[18]  Feng Wu,et al.  Online planning for large MDPs with MAXQ decomposition , 2012, AAMAS.

[19]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[20]  Andrew Howard,et al.  Design and use paradigms for Gazebo, an open-source multi-robot simulator , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[21]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[22]  Sven Behnke,et al.  Incremental action recognition and generalizing motion generation based on goal-directed features , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.