Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform significantly prior works in sample efficiency.

[1]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[2]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Anca D. Dragan,et al.  DART: Noise Injection for Robust Imitation Learning , 2017, CoRL.

[5]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[6]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[7]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[8]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[9]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[10]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[11]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[12]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[13]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[14]  Yannick Schroecker,et al.  State Aware Imitation Learning , 2017, NIPS.

[15]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[16]  Nan Jiang,et al.  Hierarchical Imitation and Reinforcement Learning , 2018, ICML.

[17]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[18]  Stefano Caselli,et al.  Grasp recognition in virtual reality for robot pregrasp planning by demonstration , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[19]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[20]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[21]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[22]  Raia Hadsell,et al.  Graph networks as learnable physics engines for inference and control , 2018, ICML.

[23]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[25]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[26]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[27]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[28]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[29]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[30]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[31]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[32]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[33]  Nando de Freitas,et al.  Robust Imitation of Diverse Behaviors , 2017, NIPS.

[34]  Jonathan Scholz,et al.  Generative predecessor models for sample-efficient imitation learning , 2019, ICLR.

[35]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[36]  Jan Peters,et al.  Guiding Trajectory Optimization by Demonstrated Distributions , 2017, IEEE Robotics and Automation Letters.

[37]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[38]  Aude Billard,et al.  Learning Stable Nonlinear Dynamical Systems With Gaussian Mixture Models , 2011, IEEE Transactions on Robotics.

[39]  Sandra Hirche,et al.  Feedback motion planning and learning from demonstration in physical robotic assistance: differences and synergies , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[40]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[41]  Tetsuya Yohira,et al.  Sample Efficient Imitation Learning for Continuous Control , 2018, ICLR.

[42]  Kee-Eung Kim,et al.  Imitation Learning via Kernel Mean Embedding , 2018, AAAI.

[43]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[44]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[45]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[46]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[47]  Matthieu Geist,et al.  Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[48]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[49]  Yisong Yue,et al.  Coordinated Multi-Agent Imitation Learning , 2017, ICML.

[50]  Alessandro Lazaric,et al.  Direct Policy Iteration with Demonstrations , 2015, IJCAI.

[51]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[52]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[53]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[54]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[55]  Byron Boots,et al.  Dual Policy Iteration , 2018, NeurIPS.

[56]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[57]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[58]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[59]  Kee-Eung Kim,et al.  A Bayesian Approach to Generative Adversarial Imitation Learning , 2018, NeurIPS.

[60]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[61]  J A Bagnell,et al.  An Invitation to Imitation , 2015 .

[62]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[63]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[64]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.