论文信息 - From Adversarial Imitation Learning to Robust Batch Imitation Learning

From Adversarial Imitation Learning to Robust Batch Imitation Learning

Imitation learning (IL) aims to learn a behavior policy through imitating the behavior of an expert. While successfully achieving high performance in various domains, IL lacks an established set of evaluation metrics that makes comparing algorithms and identifying their shortcomings difficult. This thesis proposes a suite of evaluation metrics for imitation learning, and benchmarks Behavior Cloning (BC) and Generative Adversarial Imitation Learning (GAIL), two baseline IL algorithms. Our results challenge the consensus that GAIL is favorable to BC, and argue that any perceived gain is due to a non-standard training methodology employed in prior work. In addition, these evaluations discover a shortcoming in both algorithms that has not been adequately addressed. That is, they are susceptible to expert data that consists of a mixture of optimal and degraded trajectories. Due to the noisy nature of expert data, this significantly hampers the usability of IL in the real-world. Building on recent insights from batch reinforcement learning (BIL) as well as self-supervised reward learning, I propose and study a novel batch imitation learning algorithm, Disagreement-Regularized Batch-Constrained-Q Imitation Learning (DRBIL), which learns without any interaction with the environment and is robust to expert data degradation. These properties ensure that DRBIL can learn a good policy without the agent taking risky actions or overfitting to degraded expert trajectories. I instantiate DRBIL in MuJoCo domains and demonstrate state-of-art IL performance as well as robustness to data degradation. Together, this thesis takes an important step forward in making IL rigorous and suggests a new BIL framework that is widely adaptable and satisfies critical safety desiderata.

Yecheng Jason Ma | Yecheng Jason Ma

[1] Anca D. Dragan,et al. On the Utility of Learning about Humans for Human-AI Coordination , 2019, NeurIPS.

[2] Pieter Abbeel,et al. An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[3] Deepak Pathak,et al. Self-Supervised Exploration via Disagreement , 2019, ICML.

[4] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[5] Ronald Kemker,et al. Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[6] Dale Schuurmans,et al. Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[7] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[9] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[10] R Bellman,et al. On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[11] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[12] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[13] Christos Dimitrakakis,et al. Multi-View Decision Processes: The Helper-AI Problem , 2017, NIPS.

[14] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[15] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[16] Byron Boots,et al. Agile Autonomous Driving using End-to-End Deep Imitation Learning , 2017, Robotics: Science and Systems.

[17] Tetsuya Yohira,et al. Sample Efficient Imitation Learning for Continuous Control , 2018, ICLR.

[18] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[19] Siddhartha S. Srinivasa,et al. Effects of Robot Motion on Human-Robot Collaboration , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[20] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .

[21] Anca D. Dragan,et al. DART: Noise Injection for Robust Imitation Learning , 2017, CoRL.

[22] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[23] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[24] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[25] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[26] Kyunghyun Cho,et al. Query-Efficient Imitation Learning for End-to-End Autonomous Driving , 2016, ArXiv.

[27] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[28] Razvan Pascanu,et al. Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[29] Mikael Henaff,et al. Disagreement-Regularized Imitation Learning , 2020, ICLR.

[30] Marvin Minsky,et al. Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[31] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[32] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[33] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34] Deepak Pathak,et al. Learning to Generalize via Self-Supervised Prediction , 2019 .

[35] Balaraman Ravindran,et al. RAIL: Risk-Averse Imitation Learning , 2018, AAMAS.

[36] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[37] Germán Ros,et al. CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[38] Charles A. Sutton,et al. VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[39] Marcin Andrychowicz,et al. One-Shot Imitation Learning , 2017, NIPS.

[40] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[41] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[42] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[43] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[44] Ilya Kostrikov,et al. Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[45] Adam Lerer,et al. "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[46] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[47] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[48] Martin Buss,et al. Human-Robot Collaboration: a Survey , 2008, Int. J. Humanoid Robotics.

[49] Kee-Eung Kim,et al. A Bayesian Approach to Generative Adversarial Imitation Learning , 2018, NeurIPS.

[50] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[51] Stefano Ermon,et al. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[52] Alexandros Kalousis,et al. Sample-Efficient Imitation Learning via Generative Adversarial Nets , 2018, AISTATS.

[53] Aran Nayebi,et al. CARMA : A Deep Reinforcement Learning Approach to Autonomous Driving , 2016 .

[54] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[55] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[56] Sergey Levine,et al. One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[57] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[58] Larry Rudolph,et al. Implementation Matters in Deep RL: A Case Study on PPO and TRPO , 2020, ICLR.

[59] Sergey Levine,et al. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[60] Larry Rudolph,et al. A Closer Look at Deep Policy Gradients , 2018, ICLR.

[61] Marcin Andrychowicz,et al. Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[62] Yang Liu,et al. Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening , 2016, ICLR.

[63] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[64] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[65] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[66] Tanmay Gangwani,et al. State-only Imitation with Transition Dynamics Mismatch , 2020, ICLR.

[67] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[68] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69] Sorin Grigorescu,et al. A Survey of Deep Learning Techniques for Autonomous Driving , 2020, J. Field Robotics.

[70] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[71] Jitendra Malik,et al. Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[72] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[73] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[74] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[75] Eder Santana,et al. Exploring the Limitations of Behavior Cloning for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[76] J. Andrew Bagnell,et al. Efficient Reductions for Imitation Learning , 2010, AISTATS.

[77] Dean Pomerleau,et al. Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[78] Anca D. Dragan,et al. Information gathering actions over human internal state , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).