Importance Sampling Techniques for Policy Optimization

Abstract How can we effectively exploit the collected samples when solving a continuous control task with Reinforcement Learning? Recent results have empirically demonstrated that multiple policy optimization steps can be performed with the same batch by using off–distribution techniques based on importance sampling. However, when dealing with off–distribution optimization, it is essential to take into account the uncertainty introduced by the importance sampling process. In this paper, we propose and analyze a class of model-free, policy search algorithms that extend the recent Policy Optimization via Importance Sampling (Metelli et al., 2018) by incorporating two advanced variance reduction techniques: per–decision and multiple importance sampling. For both of them, we derive a high–probability bound, of independent interest, and then we show how to employ it to define a suitable surrogate objective function that can be used for both action–based and parameter–based settings. The resulting algorithms are finally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods.

[1]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[2]  Linyuan Lu,et al.  Old and new concentration inequalities , 2006 .

[3]  J. Burbea The convexity with respect to Gaussian distributions of divergences of order a , 1984 .

[4]  Tom Schaul,et al.  Conditional Importance Sampling for Off-Policy Learning , 2019, AISTATS.

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[7]  Jun Morimoto,et al.  Adaptive Step-size Policy Gradients with Average Reward Metric , 2010, ACML.

[8]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[10]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Isao Ono,et al.  Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks , 2010, NIPS.

[13]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[14]  Jan Peters,et al.  Compatible natural gradient policy search , 2019, Machine Learning.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[17]  Frank Sehnke,et al.  Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[18]  Yishay Mansour,et al.  Learning Bounds for Importance Weighting , 2010, NIPS.

[19]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[20]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[21]  Alexander J. Smola,et al.  P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[22]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[23]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[24]  S. Amari,et al.  Information geometry of divergence functions , 2010 .

[25]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[26]  Marcello Restelli,et al.  Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration , 2020, AISTATS.

[27]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[28]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[29]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[30]  J. Schmidhuber,et al.  Multi-Dimensional Deep Memory Go-Player for Parameter Exploring Policy Gradients , 2010 .

[31]  Fady Alajaji,et al.  Rényi divergence measures for commonly used univariate continuous distributions , 2013, Inf. Sci..

[32]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[33]  F. P. Cantelli Sui confini della probabilità , 1929 .

[34]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[35]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[36]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[37]  Luca Martino,et al.  Effective sample size for importance sampling based on discrepancy measures , 2016, Signal Process..

[38]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[39]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[40]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[41]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[42]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[43]  A. Winsor Sampling techniques. , 2000, Nursing times.

[44]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[45]  J. Hoef Who Invented the Delta Method , 2012 .

[46]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[47]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[48]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[49]  B. Delyon,et al.  Concentration inequalities for sums , 2015 .

[50]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[51]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[52]  Alejandro Ribeiro,et al.  Hessian Aided Policy Gradient , 2019, ICML.

[53]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[54]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[55]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[56]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[57]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[58]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[59]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[60]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[61]  E. Ionides Truncated Importance Sampling , 2008 .

[62]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[63]  Marcello Restelli,et al.  Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[64]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[65]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[66]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[67]  G. Crooks On Measures of Entropy and Information , 2015 .

[68]  Quanquan Gu,et al.  An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient , 2019, UAI.

[69]  Philip S. Thomas,et al.  Importance Sampling for Fair Policy Selection , 2017, UAI.

[70]  Luis A. Escobar,et al.  Statistical Intervals: A Guide for Practitioners , 1991 .

[71]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[72]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[73]  Jun Morimoto,et al.  Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.

[74]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[75]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[76]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[77]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[78]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[79]  Quanquan Gu,et al.  Sample Efficient Policy Gradient Methods with Recursive Variance Reduction , 2020, ICLR.

[80]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[81]  Sham M. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[82]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[83]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[84]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[85]  Yao Liu,et al.  Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling , 2020, ICML.

[86]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[87]  B. Delyon,et al.  Concentration Inequalities for Sums and Martingales , 2015 .

[88]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[89]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[90]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[91]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[92]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[93]  Shimon Whiteson,et al.  Expected Policy Gradients , 2017, AAAI.

[94]  Leonidas J. Guibas,et al.  Optimally combining sampling techniques for Monte Carlo rendering , 1995, SIGGRAPH.

[95]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.