论文信息 - Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform significantly prior works in sample efficiency.

[1] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[2] Sergey Levine,et al. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] Anca D. Dragan,et al. DART: Noise Injection for Robust Imitation Learning , 2017, CoRL.

[5] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[6] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[7] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[8] Sergey Levine,et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[9] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[10] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[11] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[12] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[13] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[14] Yannick Schroecker,et al. State Aware Imitation Learning , 2017, NIPS.

[15] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[16] Nan Jiang,et al. Hierarchical Imitation and Reinforcement Learning , 2018, ICML.

[17] Aapo Hyvärinen,et al. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[18] Stefano Caselli,et al. Grasp recognition in virtual reality for robot pregrasp planning by demonstration , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[19] Marcin Andrychowicz,et al. Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[20] Pieter Abbeel,et al. An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[21] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[22] Raia Hadsell,et al. Graph networks as learnable physics engines for inference and control , 2018, ICML.

[23] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24] Joelle Pineau,et al. Learning from Limited Demonstrations , 2013, NIPS.

[25] Byron Boots,et al. Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[26] Stefan Schaal,et al. Learning from Demonstration , 1996, NIPS.

[27] Razvan Pascanu,et al. Learning model-based planning from scratch , 2017, ArXiv.

[28] Byron Boots,et al. Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[29] Yuandong Tian,et al. Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[30] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[31] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[32] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[33] Nando de Freitas,et al. Robust Imitation of Diverse Behaviors , 2017, NIPS.

[34] Jonathan Scholz,et al. Generative predecessor models for sample-efficient imitation learning , 2019, ICLR.

[35] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.