Behavior-Guided Reinforcement Learning

We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over trajectories that can be in turn used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[3]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[4]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[5]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[6]  Brendan Maginnis,et al.  On Wasserstein Reinforcement Learning and the Fokker-Planck equation , 2017, ArXiv.

[7]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[8]  C. Villani Optimal Transport: Old and New , 2008 .

[9]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[10]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[11]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[12]  Marc G. Bellemare,et al.  Statistics and Samples in Distributional Reinforcement Learning , 2019, ICML.

[13]  Shie Mannor,et al.  Nonlinear Distributional Gradient Temporal-Difference Learning , 2018, ICML.

[14]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[15]  Richard E. Turner,et al.  Geometrically Coupled Monte Carlo Sampling , 2018, NeurIPS.

[16]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[17]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[18]  Atil Iscen,et al.  Provably Robust Blackbox Optimization for Reinforcement Learning , 2019, CoRL.

[19]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[20]  Michalis Vazirgiannis,et al.  Matching Node Embeddings for Graph Similarity , 2017, AAAI.

[21]  Lawrence Carin,et al.  Policy Optimization as Wasserstein Gradient Flows , 2018, ICML.

[22]  R. Miikkulainen,et al.  Learning Behavior Characterizations for Novelty Search , 2016, GECCO.

[23]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[24]  Richard E. Turner,et al.  Structured Evolution with Compact Architectures for Scalable Policy Optimization , 2018, ICML.

[25]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[26]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[27]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[28]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[29]  Kenneth O. Stanley,et al.  Evolvability ES: scalable and direct optimization of evolvability , 2019, GECCO.

[30]  Benjamin Recht,et al.  Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[31]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[32]  Nicolas Le Roux,et al.  Distributional reinforcement learning with linear function approximation , 2019, AISTATS.

[33]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[34]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[35]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[36]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[37]  Kenneth O. Stanley,et al.  Quality Diversity: A New Frontier for Evolutionary Computation , 2016, Front. Robot. AI.

[38]  Richard Sinkhorn A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices , 1964 .

[39]  Zoran Popović,et al.  Learning behavior styles with inverse reinforcement learning , 2010, SIGGRAPH 2010.

[40]  Silvia Chiappa,et al.  Wasserstein Fair Classification , 2019, UAI.

[41]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[42]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[43]  J. Lehman EVOLUTION THROUGH THE SEARCH FOR NOVELTY , 2012 .