论文信息 - DERAIL: Diagnostic Environments for Reward And Imitation Learning

DERAIL: Diagnostic Environments for Reward And Imitation Learning

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .

[1] Andrew J. Davison,et al. RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[2] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[3] Anca D. Dragan,et al. SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards , 2019, ICLR.

[4] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[5] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[6] Marcin Andrychowicz,et al. Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[7] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[8] José M. F. Moura,et al. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog , 2019, NAACL.

[9] Oleg O. Sushkov,et al. A Framework for Data-Driven Robotics , 2019, ArXiv.

[10] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[11] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[13] Larry Rudolph,et al. Implementation Matters in Deep RL: A Case Study on PPO and TRPO , 2020, ICLR.

[14] Chenxi Liu,et al. CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Joelle Pineau,et al. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text , 2019, EMNLP.

[16] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[17] Dietrich Paulus,et al. Simitate: A Hybrid Imitation Learning Benchmark , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[19] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Peter Henderson,et al. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[21] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[22] Tor Lattimore,et al. Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[23] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[24] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[25] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[26] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.