Simplifying Reward Design through Divide-and-Conquer

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and-conquer approach that enables the designer to specify a reward separately for each environment. By treating these separate reward functions as observations about the underlying true reward, we derive an approach to infer a common reward across all environments. We conduct user studies in an abstract grid world domain and in a motion planning domain for a 7-DOF manipulator that measure user effort and solution quality. We show that our method is faster, easier to use, and produces a higher quality solution than the typical method of designing a reward jointly across all environments. We additionally conduct a series of experiments that measure the sensitivity of these results to different properties of the reward design task, such as the number of environments, the number of feasible solutions per environment, and the fraction of the total features that vary within each environment. We find that independent reward design outperforms the standard, joint, reward design process but works best when the design problem can be divided into simpler subproblems.

[1]  David L. Roberts,et al.  A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback , 2014, AAAI.

[2]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[3]  Rachid Alami,et al.  Planning human-aware motions using a sampling-based costmap planner , 2011, 2011 IEEE International Conference on Robotics and Automation.

[4]  Michael A. Goodrich,et al.  Design and Evaluation of Adverb Palette: A GUI for Selecting Tradeoffs in Multi-Objective Optimization Problems , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[5]  Pieter Abbeel,et al.  Motion planning with sequential convex optimization and convex collision checking , 2014, Int. J. Robotics Res..

[6]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[7]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[8]  Zoubin Ghahramani,et al.  MCMC for Doubly-intractable Distributions , 2006, UAI.

[9]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[10]  Oliver Kroemer,et al.  Active reward learning with a novel acquisition function , 2015, Auton. Robots.

[11]  Thorsten Joachims,et al.  Learning preferences for manipulation tasks from online coactive feedback , 2015, Int. J. Robotics Res..

[12]  Noah D. Goodman,et al.  Learning the Preferences of Ignorant, Inconsistent Agents , 2015, AAAI.

[13]  Smaranda Muresan,et al.  Grounding English Commands to Reward Functions , 2015, Robotics: Science and Systems.

[14]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[15]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[16]  Hiroaki Sugiyama,et al.  Preference-learning based Inverse Reinforcement Learning for Dialog Control , 2012, INTERSPEECH.

[17]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.