论文信息 - Simple random search of static linear policies is competitive for reinforcement learning

Simple random search of static linear policies is competitive for reinforcement learning

Model-free reinforcement learning aims to offer off-the-shelf solutions for controlling dynamical systems without requiring models of the system dynamics. We introduce a model-free random search algorithm for training static, linear policies for continuous control problems. Common evaluation methodology shows that our method matches state-of-the-art sample efficiency on the benchmark MuJoCo locomotion tasks. Nonetheless, more rigorous evaluation reveals that the assessment of performance on these benchmarks is optimistic. We evaluate the performance of our method over hundreds of random seeds and many different hyperparameter configurations for each benchmark task. This extensive evaluation is possible because of the small computational footprint of our method. Our simulations highlight a high variability in performance in these benchmark tasks, indicating that commonly used estimations of sample efficiency do not adequately evaluate the performance of RL algorithms. Our results stress the need for new baselines, benchmarks and evaluation methodology for RL algorithms.

[1] Adam Tauman Kalai,et al. Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[2] Lin Xiao,et al. Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback. , 2010, COLT 2010.

[3] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[4] Robert D. Nowak,et al. Query Complexity of Derivative-Free Optimization , 2012, NIPS.

[5] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[7] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[8] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[9] Vianney Perchet,et al. Highly-Smooth Zero-th Order Online Optimization , 2016, COLT.

[10] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[11] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[12] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[13] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[14] Yurii Nesterov,et al. Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[15] Elman Mansimov,et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[16] Xi Chen,et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[17] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.