Learning to Score Behaviors for Guided Policy Optimization

We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Lawrence Carin,et al.  Policy Optimization as Wasserstein Gradient Flows , 2018, ICML.

[3]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[4]  Benjamin Recht,et al.  Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[5]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[6]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[7]  Brendan Maginnis,et al.  On Wasserstein Reinforcement Learning and the Fokker-Planck equation , 2017, ArXiv.

[8]  Atil Iscen,et al.  Provably Robust Blackbox Optimization for Reinforcement Learning , 2019, CoRL.

[9]  Shie Mannor,et al.  Nonlinear Distributional Gradient Temporal-Difference Learning , 2018, ICML.

[10]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[11]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[12]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[15]  Marc G. Bellemare,et al.  Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[16]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Z. Popovic,et al.  Learning behavior styles with inverse reinforcement learning , 2010, ACM Trans. Graph..

[19]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[20]  C. Villani Optimal Transport: Old and New , 2008 .

[21]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[22]  J. Lehman EVOLUTION THROUGH THE SEARCH FOR NOVELTY , 2012 .

[23]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[24]  Nicolas Le Roux,et al.  Distributional reinforcement learning with linear function approximation , 2019, AISTATS.

[25]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[26]  Silvia Chiappa,et al.  Wasserstein Fair Classification , 2019, UAI.

[27]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[28]  Kenneth O. Stanley,et al.  Evolvability ES: scalable and direct optimization of evolvability , 2019, GECCO.

[29]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[30]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[31]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[32]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[33]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[34]  Kenneth O. Stanley,et al.  Quality Diversity: A New Frontier for Evolutionary Computation , 2016, Front. Robot. AI.

[35]  R. Miikkulainen,et al.  Learning Behavior Characterizations for Novelty Search , 2016, GECCO.

[36]  Kenneth O. Stanley,et al.  Exploiting Open-Endedness to Solve Problems Through the Search for Novelty , 2008, ALIFE.