Novel Policy Seeking with Constrained Optimization

In this work, we address the problem of learning to seek novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in previous methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies, and then design two practical novel policy seeking methods following the new perspective, namely the Constrained Task Novel Bisector (CTNB), and the Interior Policy Differentiation (IPD), corresponding to the feasible direction method and the interior point method commonly known in constrained optimization problems. Experimental comparisons on the MuJuCo control suite show our methods achieve substantial improvements over previous novelty-seeking methods in terms of both novelty and primal task performance.

[1]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[2]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[3]  Stephen J. Wright,et al.  Primal-Dual Interior-Point Methods , 1997 .

[4]  Andrzej Ruszczynski,et al.  Feasible direction methods for stochastic programming problems , 1980, Math. Program..

[5]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[6]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[7]  E. Altman Constrained Markov Decision Processes , 1999 .

[8]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[9]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[10]  Marc G. Bellemare,et al.  An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents , 2018, IJCAI.

[11]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[12]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[14]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[16]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[17]  J. Herskovits Feasible Direction Interior-Point Technique for Nonlinear Optimization , 1998 .

[18]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[19]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[20]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[21]  Nicholas I. M. Gould,et al.  A globally convergent Lagrangian barrier algorithm for optimization with general inequality constraints and simple bounds , 1997, Math. Comput..

[22]  Kenneth O. Stanley,et al.  Quality Diversity: A New Frontier for Evolutionary Computation , 2016, Front. Robot. AI.

[23]  Qiang Liu,et al.  Learning Self-Imitating Diverse Policies , 2018, ICLR.

[24]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[25]  Ludger Riischendorf The Wasserstein distance and approximation theorems , 1985 .

[26]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[27]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Yang Liu,et al.  Stein Variational Policy Gradient , 2017, UAI.

[30]  Greg Turk,et al.  Learning Novel Policies For Tasks , 2019, ICML.

[31]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[32]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[33]  Shie Mannor,et al.  Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[34]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[35]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[36]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[37]  D. Griffel Linear programming 2: Theory and extensions , by G. B. Dantzig and M. N. Thapa. Pp. 408. £50.00. 2003 ISBN 0 387 00834 9 (Springer). , 2004, The Mathematical Gazette.

[38]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[39]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[40]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[42]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[43]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[44]  Richard Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[45]  Stephen J. Wright On the convergence of the Newton/log-barrier method , 2001, Math. Program..

[46]  C. Villani Optimal Transport: Old and New , 2008 .

[47]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[48]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[49]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[50]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[51]  Kenneth O. Stanley,et al.  Novelty Search and the Problem with Objectives , 2011 .

[52]  J. Schulman,et al.  Variational Information Maximizing Exploration , 2016 .

[53]  Gunar E. Liepins,et al.  Deceptiveness and Genetic Algorithm Dynamics , 1990, FOGA.

[54]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.