Sample and Feedback Efficient Hierarchical Reinforcement Learning from Human Preferences

While reinforcement learning has led to promising results in robotics, defining an informative reward function is challenging. Prior work considered including the human in the loop to jointly learn the reward function and the optimal policy. Generating samples from a physical robot and requesting human feedback are both taxing efforts for which efficiency is critical. We propose to learn reward functions from both the robot and the human perspectives to improve on both efficiency metrics. Learning a reward function from the human perspective increases feedback efficiency by assuming that humans rank trajectories according to a low-dimensional outcome space. Learning a reward function from the robot perspective circumvents the need for a dynamics model while retaining the sample efficiency of model-based approaches. We provide an algorithm that incorporates bi-perspective reward learning into a general hierarchical reinforcement learning framework and demonstrate the merits of our approach on a toy task and a simulated robot grasping task.

[1]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[2]  L. Thurstone,et al.  A low of comparative judgement , 1927 .

[3]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4]  Stefan Schaal,et al.  Learning motion primitive goals for robust manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Oliver Kroemer,et al.  Combining active learning and reactive control for robot grasping , 2010, Robotics Auton. Syst..

[6]  Eyke Hüllermeier,et al.  Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning , 2011, ECML/PKDD.

[7]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[8]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[9]  Johannes Fürnkranz,et al.  Model-Free Preference-Based Reinforcement Learning , 2016, AAAI.

[10]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[11]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[12]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[13]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[14]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[16]  Danica Kragic,et al.  Data-Driven Grasp Synthesis—A Survey , 2013, IEEE Transactions on Robotics.

[17]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[18]  Máximo A. Roa,et al.  Grasp quality measures: review and performance , 2014, Autonomous Robots.

[19]  David Hsu,et al.  Learning Dynamic Robot-to-Human Object Handover from Human Feedback , 2016, ISRR.

[20]  Maya Cakmak,et al.  Designing robot learners that ask good questions , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21]  Ling Xu,et al.  Physical Human Interactive Guidance: Identifying Grasping Principles From Human-Planned Grasps , 2012, IEEE Trans. Robotics.

[22]  Nitish Thatte,et al.  A Sample-Efficient Black-Box Optimizer to Train Policies for Human-in-the-Loop Systems With User Preferences , 2017, IEEE Robotics and Automation Letters.

[23]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[24]  Oliver Kroemer,et al.  Active reward learning with a novel acquisition function , 2015, Auton. Robots.

[25]  Wei Chu,et al.  Preference learning with Gaussian processes , 2005, ICML.

[26]  Jan Peters,et al.  Experiments with Hierarchical Reinforcement Learning of Multiple Grasping Policies , 2016, ISER.

[27]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[28]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[29]  Michèle Sebag,et al.  Programming by Feedback , 2014, ICML.

[30]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[31]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[32]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[33]  Anis Sahbani,et al.  An overview of 3D object grasp synthesis algorithms , 2012, Robotics Auton. Syst..

[34]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[35]  Stefan Schaal,et al.  Hierarchical reinforcement learning with movement primitives , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.