Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions

The potential of reinforcement learning (RL) to deliver aligned and performant agents is partially bottlenecked by the reward engineering problem. One alternative to heuristic trial-and-error is preference-based RL (PbRL), where a reward function is inferred from sparse human feedback. However, prior PbRL methods lack interpretability of the learned reward structure, which hampers the ability to assess robustness and alignment. We propose an online, active preference learning algorithm that constructs reward functions with the intrinsically interpretable, compositional structure of a tree. Using both synthetic and human-provided feedback, we demonstrate sample-efficient learning of tree-structured reward functions in several environments, then harness the enhanced interpretability to explore and debug for alignment. ACM Reference Format: Tom Bewley and Freddy Lecue. 2022. Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions. In Proc. of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2022), Online, May 9–13, 2022, IFAAMAS, 18 pages.

[1]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[2]  Alessandro Lazaric,et al.  Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits , 2011, ALT.

[3]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[4]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[5]  Ian R. Fasel,et al.  Design Principles for Creating Human-Shapable Agents , 2009, AAAI Spring Symposium: Agents that Learn from Human Teachers.

[6]  Eugene Santos,et al.  Explaining Reward Functions in Markov Decision Processes , 2019, FLAIRS.

[7]  H. Gulliksen A least squares solution for paired comparisons with incomplete data , 1956 .

[8]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[9]  Daniel Kahneman,et al.  Evaluation by Moments: Past and Future , 2002 .

[10]  F. Mosteller Remarks on the method of paired comparisons: I. The least squares solution assuming equal standard deviations and equal correlations , 1951 .

[11]  László Csató,et al.  A graph interpretation of the least squares ranking method , 2015, Soc. Choice Welf..

[12]  Jonathan Lawry,et al.  TripleTree: A Versatile Interpretable Representation of Black Box Agents and their Environments , 2020, AAAI.

[13]  D. Hunter MM algorithms for generalized Bradley-Terry models , 2003 .

[14]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[15]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[16]  Takeo Igarashi,et al.  A Survey on Interactive Reinforcement Learning: Design Principles and Open Challenges , 2020, Conference on Designing Interactive Systems.

[17]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[18]  Daniel Dewey,et al.  Reinforcement Learning and the Reward Engineering Principle , 2014, AAAI Spring Symposia.

[19]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .

[20]  L. Thurstone A law of comparative judgment. , 1994 .

[21]  Jude W. Shavlik,et al.  Creating Advice-Taking Reinforcement Learners , 1998, Machine Learning.

[22]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[23]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[24]  Devavrat Shah,et al.  Iterative ranking from pair-wise comparisons , 2012, NIPS.

[25]  Alan Fern,et al.  Explainable Reinforcement Learning via Reward Decomposition , 2019 .

[26]  Dorsa Sadigh,et al.  APReL: A Library for Active Preference-based Reward Learning Algorithms , 2021, 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[27]  Stratis Ioannidis,et al.  Experimental Design under the Bradley-Terry Model , 2018, IJCAI.

[28]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[29]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[30]  Stuart J. Russell,et al.  Understanding Learned Reward Functions , 2020, ArXiv.

[31]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[32]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.