Understanding Learned Reward Functions

In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to be able to audit learned reward functions to verify whether they truly capture user preferences. In this paper, we investigate techniques for interpreting learned reward functions. In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions. We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment. We also discover that existing interpretability techniques often attend to irrelevant changes in reward output, suggesting that reward interpretability may need significantly different methods from policy interpretability.

[1]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[2]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[3]  N Wiener,et al.  Some moral and technical consequences of automation , 1960, Science.

[4]  Christopher Joseph Pal,et al.  Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents , 2019, ICLR.

[5]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[6]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[7]  Daniel Gómez,et al.  Polynomial calculation of the Shapley value based on sampling , 2009, Comput. Oper. Res..

[8]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[9]  Jonathan Dodge,et al.  Visualizing and Understanding Atari Agents , 2017, ICML.

[10]  Ziyan Wu,et al.  Counterfactual Visual Explanations , 2019, ICML.

[11]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[12]  Alex Mott,et al.  Towards Interpretable Reinforcement Learning Using Attention Augmented Agents , 2019, NeurIPS.

[13]  Sergey Levine,et al.  Causal Confusion in Imitation Learning , 2019, NeurIPS.

[14]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[15]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Anca D. Dragan,et al.  Reward-rational (implicit) choice: A unifying formalism for reward learning , 2020, NeurIPS.

[18]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[19]  Shane Legg,et al.  Quantifying Differences in Reward Functions , 2020, ArXiv.

[20]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[21]  Eliezer Yudkowsky Artificial Intelligence as a Positive and Negative Factor in Global Risk , 2006 .

[22]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[23]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Eugene Santos,et al.  Explaining Reward Functions in Markov Decision Processes , 2019, FLAIRS.

[25]  Dylan Hadfield-Menell,et al.  Multi-Principal Assistance Games , 2020, ArXiv.

[26]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[27]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.