Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

[1]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[2]  Stephen L. Smith,et al.  Bayesian Active Learning for Collaborative Task Specification Using Equivalence Regions , 2019, IEEE Robotics and Automation Letters.

[3]  Dorsa Sadigh,et al.  Asking Easy Questions: A User-Friendly Approach to Active Reward Learning , 2019, CoRL.

[4]  Dorsa Sadigh,et al.  Batch Active Preference-Based Learning of Reward Functions , 2018, CoRL.

[5]  Mykel J. Kochenderfer,et al.  Learning an Urban Air Mobility Encounter Model from Expert Preferences , 2019, 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC).

[6]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[7]  Joel W. Burdick,et al.  ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Anca D. Dragan,et al.  Learning from Physical Human Corrections, One Feature at a Time , 2018, 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[9]  Dorsa Sadigh,et al.  The Green Choice: Learning and Influencing Human Decisions on Shared Roads , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[10]  R. Luce,et al.  Individual Choice Behavior: A Theoretical Analysis. , 1960 .

[11]  Scott Niekum,et al.  Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations , 2019, CoRL.

[12]  Todd Kulesza,et al.  Structured labeling for facilitating concept evolution in machine learning , 2014, CHI.

[13]  Matthew Gombolay,et al.  Learning from Suboptimal Demonstration via Self-Supervised Reward Regression , 2020, ArXiv.

[14]  Anca D. Dragan,et al.  On the Utility of Model Learning in HRI , 2019, 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[15]  Mukesh Singhal,et al.  Do You Want Your Autonomous Car to Drive Like You? , 2015, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[16]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[17]  Stefanos Nikolaidis,et al.  Efficient Model Learning for Human-Robot Collaborative Tasks , 2014, ArXiv.

[18]  Dorsa Sadigh,et al.  Active Learning of Reward Dynamics from Hierarchical Queries , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  Siddhartha S. Srinivasa,et al.  Human preferences for robot-human hand-over configurations , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[21]  K. S. Krishnan Incorporating Thresholds of Indifference in Probabilistic Choice Models , 1977 .

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[24]  剛 大北 会議報告:The 33rd Conference on Neural Information Processing Systems(NeurIPS 2019) , 2020 .

[25]  Siddhartha S. Srinivasa,et al.  Shared Autonomy via Hindsight Optimization , 2015, Robotics: Science and Systems.

[26]  Scott Sanner,et al.  Real-time Multiattribute Bayesian Preference Elicitation with Pairwise Comparison Queries , 2010, AISTATS.

[27]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[28]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[29]  Thomas L. Griffiths,et al.  A rational model of preference learning and choice prediction by children , 2008, NIPS.

[30]  Jonathan P. How,et al.  Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[31]  Maya Cakmak,et al.  Keyframe-based Learning from Demonstration , 2012, Int. J. Soc. Robotics.

[32]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Anca D. Dragan,et al.  Learning Robot Objectives from Physical Human Interaction , 2017, CoRL.

[34]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[35]  Ankit Shah,et al.  Interactive Robot Training for Non-Markov Tasks , 2020, ArXiv.

[36]  Daniel King,et al.  Fetch & Freight : Standard Platforms for Service Robot Applications , 2016 .

[37]  Aaron D. Ames,et al.  Preference-Based Learning for Exoskeleton Gait Optimization , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Nicholas Roy,et al.  Inferring Task Goals and Constraints using Bayesian Nonparametric Inverse Reinforcement Learning , 2019, CoRL.

[39]  Stefanos Nikolaidis,et al.  Efficient Model Learning from Joint-Action Demonstrations for Human-Robot Collaborative Tasks , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[40]  Mykel J. Kochenderfer,et al.  Bayesian Preference Elicitation for Multiobjective Engineering Design Optimization , 2015, J. Aerosp. Inf. Syst..

[41]  Scott Niekum,et al.  Deep Bayesian Reward Learning from Preferences , 2019, ArXiv.

[42]  Dylan P. Losey,et al.  Here’s What I’ve Learned: Asking Questions that Reveal Reward Learning , 2021, ACM Trans. Hum. Robot Interact..

[43]  Mykel J. Kochenderfer,et al.  Preference-based Learning of Reward Function Features , 2021, ArXiv.

[44]  Anca D. Dragan,et al.  Learning under Misspecified Objective Spaces , 2018, CoRL.

[45]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[46]  Siddhartha S. Srinivasa,et al.  Formalizing Assistive Teleoperation , 2012, Robotics: Science and Systems.

[47]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[48]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[49]  Dorsa Sadigh,et al.  Learning Reward Functions by Integrating Human Demonstrations and Preferences , 2019, Robotics: Science and Systems.

[50]  Anca D. Dragan,et al.  Planning for Autonomous Cars that Leverage Effects on Human Actions , 2016, Robotics: Science and Systems.

[51]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[52]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[53]  Craig Boutilier,et al.  Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets , 2010, NIPS.

[54]  Siddhartha S. Srinivasa,et al.  Active Comparison Based Learning Incorporating User Uncertainty and Noise , 2016 .

[55]  Mark D. Uncles,et al.  Discrete Choice Analysis: Theory and Application to Travel Demand , 1987 .

[56]  Dorsa Sadigh,et al.  Learning Human Objectives from Sequences of Physical Corrections , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[57]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[58]  Dorsa Sadigh,et al.  When Humans Aren’t Optimal: Robots that Collaborate with Risk-Aware Humans , 2020, 2020 15th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[59]  Nir Ailon,et al.  An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity , 2010, J. Mach. Learn. Res..

[60]  Katherine J. Kuchenbecker,et al.  Data-Driven Motion Mappings Improve Transparency in Teleoperation , 2015, PRESENCE: Teleoperators and Virtual Environments.

[61]  Nima Anari,et al.  Batch Active Learning Using Determinantal Point Processes , 2019, ArXiv.

[62]  Mykel J. Kochenderfer,et al.  Active preference-based Gaussian process regression for reward learning and optimization , 2020, Robotics: Science and Systems.

[63]  Moshe Ben-Akiva,et al.  Discrete Choice Analysis: Theory and Application to Travel Demand , 1985 .

[64]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[65]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.