论文信息 - Importance Sampling Techniques for Policy Optimization

Importance Sampling Techniques for Policy Optimization

Abstract How can we effectively exploit the collected samples when solving a continuous control task with Reinforcement Learning? Recent results have empirically demonstrated that multiple policy optimization steps can be performed with the same batch by using off–distribution techniques based on importance sampling. However, when dealing with off–distribution optimization, it is essential to take into account the uncertainty introduced by the importance sampling process. In this paper, we propose and analyze a class of model-free, policy search algorithms that extend the recent Policy Optimization via Importance Sampling (Metelli et al., 2018) by incorporating two advanced variance reduction techniques: per–decision and multiple importance sampling. For both of them, we derive a high–probability bound, of independent interest, and then we show how to employ it to define a suitable surrogate objective function that can be used for both action–based and parameter–based settings. The resulting algorithms are finally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods.

[1] Jing Peng,et al. Incremental multi-step Q-learning , 1994, Machine Learning.

[2] Linyuan Lu,et al. Old and new concentration inequalities , 2006 .

[3] J. Burbea. The convexity with respect to Gaussian distributions of divergences of order a , 1984 .

[4] Tom Schaul,et al. Conditional Importance Sampling for Off-Policy Learning , 2019, AISTATS.

[5] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[7] Jun Morimoto,et al. Adaptive Step-size Policy Gradients with Average Reward Metric , 2010, ACML.

[8] H. Sebastian Seung,et al. Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[10] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] Isao Ono,et al. Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks , 2010, NIPS.

[13] Sergey Levine,et al. The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[14] Jan Peters,et al. Compatible natural gradient policy search , 2019, Machine Learning.

[15] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16] Marcin Andrychowicz,et al. Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[17] Frank Sehnke,et al. Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[18] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[19] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[21] Alexander J. Smola,et al. P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[22] Shie Mannor,et al. Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[23] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[24] S. Amari,et al. Information geometry of divergence functions , 2010 .

[25] Nicolas Le Roux,et al. Understanding the impact of entropy on policy optimization , 2018, ICML.

[26] Marcello Restelli,et al. Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration , 2020, AISTATS.

[27] Alexandre M. Bayen,et al. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[28] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[29] Nikolaus Hansen,et al. Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[30] J. Schmidhuber,et al. Multi-Dimensional Deep Memory Go-Player for Parameter Exploring Policy Gradients , 2010 .

[31] Fady Alajaji,et al. Rényi divergence measures for commonly used univariate continuous distributions , 2013, Inf. Sci..