Avoidance Learning Using Observational Reinforcement Learning

Imitation learning seeks to learn an expert policy from sampled demonstrations. However, in the real world, it is often difficult to find a perfect expert and avoiding dangerous behaviors becomes relevant for safety reasons. We present the idea of \textit{learning to avoid}, an objective opposite to imitation learning in some sense, where an agent learns to avoid a demonstrator policy given an environment. We define avoidance learning as the process of optimizing the agent's reward while avoiding dangerous behaviors given by a demonstrator. In this work we develop a framework of avoidance learning by defining a suitable objective function for these problems which involves the \emph{distance} of state occupancy distributions of the expert and demonstrator policies. We use density estimates for state occupancy measures and use the aforementioned distance as the reward bonus for avoiding the demonstrator. We validate our theory with experiments using a wide range of partially observable environments. Experimental results show that we are able to improve sample efficiency during training compared to state of the art policy optimization and safety methods.

[1]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[2]  E. Altman Constrained Markov Decision Processes , 1999 .

[3]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[4]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[5]  K. Dautenhahn,et al.  Imitation and Social Learning in Robots, Humans and Animals: Behavioural, Social and Communicative Dimensions , 2009 .

[6]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[7]  Shie Mannor,et al.  Policy Gradients Beyond Expectations: Conditional Value-at-Risk , 2014, ArXiv.

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[10]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[11]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[12]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[13]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[14]  Martin Buss,et al.  Understanding Human Avoidance Behavior: Interaction-Aware Decision Making Based on Game Theory , 2016, Int. J. Soc. Robotics.

[15]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[16]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[17]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[18]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[19]  T. Robbins,et al.  Value generalization in human avoidance learning , 2017, bioRxiv.