Novel Policy Seeking with Constrained Optimization

In this work, we address the problem of learning to seek novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in previous methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies, and then design two practical novel policy seeking methods following the new perspective, namely the Constrained Task Novel Bisector (CTNB), and the Interior Policy Differentiation (IPD), corresponding to the feasible direction method and the interior point method commonly known in constrained optimization problems. Experimental comparisons on the MuJuCo control suite show our methods achieve substantial improvements over previous novelty-seeking methods in terms of both novelty and primal task performance.

[1]  Richard Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[2]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[3]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[5]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[6]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[7]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[8]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[9]  Greg Turk,et al.  Learning Novel Policies For Tasks , 2019, ICML.

[10]  Andrzej Ruszczynski,et al.  Feasible direction methods for stochastic programming problems , 1980, Math. Program..

[11]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[12]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[15]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[17]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[18]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[19]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[20]  Yang Liu,et al.  Stein Variational Policy Gradient , 2017, UAI.

[21]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[22]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[25]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[26]  Marc G. Bellemare,et al.  An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents , 2018, IJCAI.

[27]  Stephen J. Wright On the convergence of the Newton/log-barrier method , 2001, Math. Program..

[28]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[30]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[31]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[32]  Qiang Liu,et al.  Learning Self-Imitating Diverse Policies , 2018, ICLR.

[33]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[34]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[35]  J. Herskovits Feasible Direction Interior-Point Technique for Nonlinear Optimization , 1998 .

[36]  C. Villani Optimal Transport: Old and New , 2008 .

[37]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[38]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[39]  Nicholas I. M. Gould,et al.  A globally convergent Lagrangian barrier algorithm for optimization with general inequality constraints and simple bounds , 1997, Math. Comput..

[40]  Kenneth O. Stanley,et al.  Novelty Search and the Problem with Objectives , 2011 .

[41]  Ludger Riischendorf The Wasserstein distance and approximation theorems , 1985 .

[42]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[43]  Gunar E. Liepins,et al.  Deceptiveness and Genetic Algorithm Dynamics , 1990, FOGA.

[44]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[45]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[46]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[47]  D. Griffel Linear programming 2: Theory and extensions , by G. B. Dantzig and M. N. Thapa. Pp. 408. £50.00. 2003 ISBN 0 387 00834 9 (Springer). , 2004, The Mathematical Gazette.

[48]  Shie Mannor,et al.  Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[49]  J. Schulman,et al.  Variational Information Maximizing Exploration , 2016 .

[50]  E. Altman Constrained Markov Decision Processes , 1999 .

[51]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[52]  Stephen J. Wright,et al.  Primal-Dual Interior-Point Methods , 1997 .

[53]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[54]  Kenneth O. Stanley,et al.  Quality Diversity: A New Frontier for Evolutionary Computation , 2016, Front. Robot. AI.