论文信息 - Learnings Options End-to-End for Continuous Action Tasks

Learnings Options End-to-End for Continuous Action Tasks

We present new results on learning temporally extended actions for continuoustasks, using the options framework (Suttonet al.[1999b], Precup [2000]). In orderto achieve this goal we work with the option-critic architecture (Baconet al.[2017])using a deliberation cost and train it with proximal policy optimization (Schulmanet al.[2017]) instead of vanilla policy gradient. Results on Mujoco domains arepromising, but lead to interesting questions aboutwhena given option should beused, an issue directly connected to the use of initiation sets.

[1] Doina Precup,et al. Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[2] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[3] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[5] Herbert A. Simon,et al. The Sciences of the Artificial , 1970 .

[6] Gregory Dudek,et al. Benchmark Environments for Multitask Learning in Continuous Domains , 2017, ArXiv.

[7] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[8] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9] Doina Precup,et al. When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[10] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[11] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12] Doina Precup,et al. The Option-Critic Architecture , 2016, AAAI.