Explanation Augmented Feedback in Human-in-the-Loop Reinforcement Learning

Human-in-the-loop Reinforcement Learning (HRL) aims to integrate human guidance with Reinforcement Learning (RL) algorithms to improve sample efficiency and performance. The usual human guidance in HRL is binary evaluative "good" or "bad" signal for queried states and actions. However, this suffers from the problems of weak supervision and poor efficiency in leveraging human feedback. To address this, we present EXPAND (Explanation Augmented Feedback) which allows for explanatory information to be given as saliency maps from the human in addition to the binary feedback. EXPAND employs a state perturbation approach based on the state salient information to augment the feedback, reducing the number of human feedback signals required. We choose two domains to evaluate this approach, Taxi and Atari-Pong. We demonstrate the effectiveness of our method on three metrics, environment sample efficiency, human feedback sample efficiency, and agent gaze. We show that our method outperforms our baselines. Finally, we present an ablation study to confirm our hypothesis that augmenting binary feedback with state salient information gives a boost in performance.

[1]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[2]  Michael L. Littman,et al.  Deep Reinforcement Learning from Policy-Dependent Human Feedback , 2019, ArXiv.

[3]  Andrea Lockerd Thomaz,et al.  Policy Shaping with Human Teachers , 2015, IJCAI.

[4]  Bo Liu,et al.  Human Gaze Assisted Artificial Intelligence: A Review , 2020, IJCAI.

[5]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[6]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[7]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[8]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[9]  Marc G. Bellemare,et al.  An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents , 2018, IJCAI.

[10]  David L. Roberts,et al.  Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning , 2015, Autonomous Agents and Multi-Agent Systems.

[11]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[12]  David L. Roberts,et al.  A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans , 2016, AAMAS.

[13]  Peter Stone,et al.  Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[16]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[17]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[18]  Mikhail Pavlov,et al.  Deep Attention Recurrent Q-Network , 2015, ArXiv.

[19]  Peter Stone,et al.  Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces , 2017, AAAI.

[20]  Matthieu Geist,et al.  Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[21]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[22]  Yasuo Kuniyoshi,et al.  Using Human Gaze to Improve Robustness Against Irrelevant Objects in Robot Manipulation Tasks , 2020, IEEE Robotics and Automation Letters.

[23]  Alex Mott,et al.  Towards Interpretable Reinforcement Learning Using Attention Augmented Agents , 2019, NeurIPS.

[24]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[25]  Luxin Zhang,et al.  Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset , 2019, ArXiv.

[26]  Brenden M. Lake,et al.  Investigating Simple Object Representations in Model-Free Deep Reinforcement Learning , 2020, ArXiv.

[27]  Luxin Zhang,et al.  AGIL: Learning Attention from Human for Visuomotor Tasks , 2018, ECCV.

[28]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[29]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[30]  Andrea Lockerd Thomaz,et al.  Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[31]  Radha Poovendran,et al.  FRESH: Interactive Reward Shaping in High-Dimensional State Spaces using Human Feedback , 2020, AAMAS.

[32]  David L. Roberts,et al.  Learning something from nothing: Leveraging implicit human feedback strategies , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[33]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[34]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[35]  Sameer Singh,et al.  Explain Your Move: Understanding Agent Actions Using Focused Feature Saliency , 2019, ICLR 2020.

[36]  Shie Mannor,et al.  Graying the black box: Understanding DQNs , 2016, ICML.

[37]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[38]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[41]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[42]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[43]  Karen M. Feigh,et al.  Learning From Explanations Using Sentiment and Advice in RL , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[44]  Peter Stone,et al.  Leveraging Human Guidance for Deep Reinforcement Learning Tasks , 2019, IJCAI.

[45]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[46]  Yuta Tsuboi,et al.  DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback , 2018, ArXiv.

[47]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[48]  Eduardo F. Morales,et al.  Dynamic Reward Shaping: Training a Robot by Voice , 2010, IBERAMIA.

[49]  Peter Stone,et al.  Reinforcement learning from simultaneous human and MDP reward , 2012, AAMAS.

[50]  Ufuk Topcu,et al.  Environment-Independent Task Specifications via GLTL , 2017, ArXiv.

[51]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[52]  Jiri Matas,et al.  Discriminative Correlation Filter with Channel and Spatial Reliability , 2017, CVPR.

[53]  Jonathan Dodge,et al.  Visualizing and Understanding Atari Agents , 2017, ICML.