Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally intractable for complex control problems. We propose Bayesian Reward Extrapolation (Bayesian REX), a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference. Bayesian REX can learn to play Atari games from demonstrations, without access to the game score and can generate 100,000 samples from the posterior over reward functions in only 5 minutes on a personal laptop. Bayesian REX also results in imitation learning performance that is competitive with or better than state-of-the-art methods that only learn point estimates of the reward function. Finally, Bayesian REX enables efficient high-confidence policy evaluation without having access to samples of the reward function. These high-confidence performance bounds can be used to rank the performance and risk of a variety of evaluation policies and provide a way to detect reward hacking behaviors.

[1]  S. Levine,et al.  Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[2]  Peter Stone,et al.  Stochastic Grounded Action Transformation for Robot Learning in Simulation , 2017, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[4]  Dorsa Sadigh,et al.  Asking Easy Questions: A User-Friendly Approach to Active Reward Learning , 2019, CoRL.

[5]  A. Vries Value at Risk , 2019, Derivatives.

[6]  Dorsa Sadigh,et al.  Learning Reward Functions by Integrating Human Demonstrations and Preferences , 2019, Robotics: Science and Systems.

[7]  Sergey Levine,et al.  Causal Confusion in Imitation Learning , 2019, NeurIPS.

[8]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[9]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[10]  Marek Petrik,et al.  Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , 2019, NeurIPS.

[11]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[12]  Peter Eckersley Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function) , 2019, SafeAI@AAAI.

[13]  Marco Pavone,et al.  Risk-Sensitive Generative Adversarial Imitation Learning , 2018, AISTATS.

[14]  Katherine Rose Driggs-Campbell,et al.  EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning , 2018, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Peter Stone,et al.  Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[16]  Finale Doshi-Velez,et al.  Projected BNNs: Avoiding weight-space pathologies by learning latent representations of neural network weights , 2018, 1811.07006.

[17]  Finale Doshi-Velez,et al.  Latent Projection BNNs: Avoiding weight-space pathologies by learning latent representations of neural network weights , 2018, ArXiv.

[18]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[19]  Yuchen Cui,et al.  Risk-Aware Active Inverse Reinforcement Learning , 2018, CoRL.

[20]  Anca D. Dragan,et al.  Learning under Misspecified Objective Spaces , 2018, CoRL.

[21]  Zoubin Ghahramani,et al.  Variational Bayesian dropout: pitfalls and fixes , 2018, ICML.

[22]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[23]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[24]  Yang Cai,et al.  Learning Safe Policies with Expert Guidance , 2018, NeurIPS.

[25]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[26]  Yuchen Cui,et al.  Active Reward Learning from Critiques , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[28]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[29]  Scott Niekum,et al.  Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning , 2017, AAAI.

[30]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[31]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[32]  Katherine Rose Driggs-Campbell,et al.  DropoutDAgger: A Bayesian Approach to Safe Imitation Learning , 2017, ArXiv.

[33]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[34]  Marco Pavone,et al.  Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models , 2017, Robotics: Science and Systems.

[35]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[36]  Anca D. Dragan,et al.  Pragmatic-Pedagogic Value Alignment , 2017, ISRR.

[37]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[38]  Brendan J. Frey,et al.  PixelGAN Autoencoders , 2017, NIPS.

[39]  Anca D. Dragan,et al.  Should Robots be Obedient? , 2017, IJCAI.

[40]  Anca D. Dragan,et al.  DART: Noise Injection for Robust Imitation Learning , 2017, CoRL.

[41]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[42]  Kyunghyun Cho,et al.  Query-Efficient Imitation Learning for End-to-End Simulated Driving , 2017, AAAI.

[43]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[44]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[45]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[46]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[47]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[48]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[49]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[50]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[51]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[52]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[53]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[54]  Ian Osband,et al.  Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout , 2016 .

[55]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[56]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[57]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[58]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[61]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[62]  Maksims Volkovs,et al.  New learning methods for supervised and unsupervised preference aggregation , 2014, J. Mach. Learn. Res..

[63]  Robin S Spruce,et al.  "Learning to be a Learner" , 2021 .

[64]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[65]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[66]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[67]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[68]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[69]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[70]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[71]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[72]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[73]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[74]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[75]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[76]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[77]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[78]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .