Novel Policy Seeking with Constrained Optimization

In this work, we address the problem of learning to seek novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in previous methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies, and then design two practical novel policy seeking methods following the new perspective, namely the Constrained Task Novel Bisector (CTNB), and the Interior Policy Differentiation (IPD), corresponding to the feasible direction method and the interior point method commonly known in constrained optimization problems. Experimental comparisons on the MuJuCo control suite show our methods achieve substantial improvements over previous novelty-seeking methods in terms of both novelty and primal task performance.

[1]  Zhenghao Peng,et al.  Safe Exploration by Solving Early Terminated MDP , 2021, ArXiv.

[2]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[3]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[4]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[5]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[6]  Robert Loftin,et al.  Better Exploration with Optimistic Actor-Critic , 2019, NeurIPS.

[7]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[8]  Shie Mannor,et al.  Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[9]  Greg Turk,et al.  Learning Novel Policies For Tasks , 2019, ICML.

[10]  Richard Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[11]  Marc G. Bellemare,et al.  An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents , 2018, IJCAI.

[12]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[13]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[14]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[15]  Qiang Liu,et al.  Learning Self-Imitating Diverse Policies , 2018, ICLR.

[16]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[17]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[18]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[19]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[20]  Joel Z. Leibo,et al.  Inequity aversion improves cooperation in intertemporal social dilemmas , 2018, NeurIPS.

[21]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[22]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[23]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[24]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[25]  Alexander Peysakhovich,et al.  Consequentialist conditional cooperation in social dilemmas with imperfect information , 2017, AAAI Workshops.

[26]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[27]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[30]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[31]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Yang Liu,et al.  Stein Variational Policy Gradient , 2017, UAI.

[33]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[34]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[35]  Rachna,et al.  Sapiens: A brief history of humankind , 2017 .

[36]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[37]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[38]  Kenneth O. Stanley,et al.  Quality Diversity: A New Frontier for Evolutionary Computation , 2016, Front. Robot. AI.

[39]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[40]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[41]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[42]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[43]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[44]  J. Schulman,et al.  Variational Information Maximizing Exploration , 2016 .

[45]  J. Henrich The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter , 2015 .

[46]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[47]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[48]  Ana Paiva,et al.  Emerging social awareness: Exploring intrinsic motivation in multiagent learning , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[49]  Judith M Burkart,et al.  Social learning and evolution: the cultural intelligence hypothesis , 2011, Philosophical Transactions of the Royal Society B: Biological Sciences.

[50]  Kenneth O. Stanley,et al.  Novelty Search and the Problem with Objectives , 2011 .

[51]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[52]  C. Villani Optimal Transport: Old and New , 2008 .

[53]  M. Grzes,et al.  Plan-based reward shaping for reinforcement learning , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[54]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55]  D. Griffel Linear programming 2: Theory and extensions , by G. B. Dantzig and M. N. Thapa. Pp. 408. £50.00. 2003 ISBN 0 387 00834 9 (Springer). , 2004, The Mathematical Gazette.

[56]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[57]  Stephen J. Wright,et al.  Primal-Dual Interior-Point Methods , 1997 .

[58]  G. DeJong,et al.  Theory and Application of Reward Shaping in Reinforcement Learning , 2004 .

[59]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[60]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[61]  Stephen J. Wright On the convergence of the Newton/log-barrier method , 2001, Math. Program..

[62]  E. Deci,et al.  Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[63]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[64]  E. Altman Constrained Markov Decision Processes , 1999 .

[65]  J. Herskovits Feasible Direction Interior-Point Technique for Nonlinear Optimization , 1998 .

[66]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[67]  Nicholas I. M. Gould,et al.  A globally convergent Lagrangian barrier algorithm for optimization with general inequality constraints and simple bounds , 1997, Math. Comput..

[68]  B. Rogoff Apprenticeship in Thinking: Cognitive Development in Social Context , 1990 .

[69]  Gunar E. Liepins,et al.  Deceptiveness and Genetic Algorithm Dynamics , 1990, FOGA.

[70]  Ludger Riischendorf The Wasserstein distance and approximation theorems , 1985 .

[71]  Andrzej Ruszczynski,et al.  Feasible direction methods for stochastic programming problems , 1980, Math. Program..