论文信息 - BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

There has recently been a surge in research in batch Deep Reinforcement Learning (DRL), which aims for learning a high-performing policy from a given dataset without additional interactions with the environment. We propose a new algorithm, Best-Action Imitation Learning (BAIL), which strives for both simplicity and performance. BAIL learns a V function, uses the V function to select actions it believes to be high-performing, and then uses those actions to train a policy network using imitation learning. For the MuJoCo benchmark, we provide a comprehensive experimental study of BAIL, comparing its performance to four other batch Q-learning and imitation-learning schemes for a large variety of batch datasets. Our experiments show that BAIL's performance is much higher than the other schemes, and is also computationally much faster than the batch Q-learning schemes.

[1] Nolan Wagener,et al. Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[4] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[5] Stefan Schaal,et al. Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[6] Bernhard Schölkopf,et al. A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[7] Mohamed Medhat Gaber,et al. Imitation Learning , 2017, ACM Comput. Surv..

[8] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[9] Byron Boots,et al. Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[10] Justin Fu,et al. D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[11] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[13] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[14] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[15] Matthieu Geist,et al. Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[16] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17] Alessandro Lazaric,et al. Direct Policy Iteration with Demonstrations , 2015, IJCAI.

[18] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[19] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20] Joelle Pineau,et al. Learning from Limited Demonstrations , 2013, NIPS.

[21] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[22] Yiming Zhang,et al. Supervised Policy Update for Deep Reinforcement Learning , 2018, ICLR.

[23] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.