Adversarial Intrinsic Motivation for Reinforcement Learning

Learning with an objective to minimize the mismatch with a reference distribution has been shown to be useful for generative modeling and imitation learning. In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy’s state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks. Specifically, this paper focuses on goal-conditioned reinforcement learning where the idealized (unachievable) target distribution has full measure at the goal. This paper introduces a quasimetric specific to Markov Decision Processes (MDPs) and uses this quasimetric to estimate the above Wasserstein-1 distance. It further shows that the policy that minimizes this Wasserstein-1 distance is the policy that reaches the goal in as few steps as possible. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function. Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and directs the agent’s exploration to find the goal efficiently. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to other rewards that encourage exploration or accelerate learning.

[1]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[2]  Huang Xiao,et al.  Wasserstein Adversarial Imitation Learning , 2019, ArXiv.

[3]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[6]  A. Barto,et al.  Intrinsic Motivation For Reinforcement Learning Systems , 2005 .

[7]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[8]  Sergey Levine,et al.  Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery , 2020, ICLR.

[9]  Sergey Levine,et al.  C-Learning: Learning to Achieve Goals via Recursive Classification , 2020, ICLR.

[10]  O. Bousquet,et al.  From optimal transport to generative modeling: the VEGAN cookbook , 2017, 1705.07642.

[11]  Pierre-Yves Oudeyer,et al.  Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[12]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[13]  Junhyuk Oh,et al.  What Can Learned Intrinsic Rewards Capture? , 2019, ICML.

[14]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[17]  Michael B. Smyth,et al.  Quasi Uniformities: Reconciling Domains with Metric Spaces , 1987, MFPS.

[18]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[19]  Matthieu Geist,et al.  Primal Wasserstein Imitation Learning , 2020, ICLR.

[20]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[21]  C. Villani Optimal Transport: Old and New , 2008 .

[22]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[23]  Oriol Vinyals,et al.  Synthesizing Programs for Images using Reinforced Adversarial Learning , 2018, ICML.

[24]  Carola-Bibiane Schönlieb,et al.  Wasserstein GANs Work Because They Fail (to Approximate the Wasserstein Distance) , 2021, ArXiv.

[25]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[26]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[27]  Marlos C. Machado,et al.  Exploration in Reinforcement Learning with Deep Covering Options , 2020, ICLR.

[28]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[29]  Scott Niekum Evolved Intrinsic Reward Functions for Reinforcement Learning , 2010, AAAI.

[30]  Tianwei Ni,et al.  f-IRL: Inverse Reinforcement Learning via State Marginal Matching , 2020, ArXiv.

[31]  A. Barto,et al.  Intrinsic motivations and open-ended development in animals, humans, and robots: an overview , 2014, Front. Psychol..

[32]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[33]  Marco Mirolli,et al.  Which is the best intrinsic motivation signal for learning multiple skills? , 2013, Front. Neurorobot..

[34]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[35]  Peter Henderson,et al.  An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning , 2021, ArXiv.

[36]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[38]  Philip S. Thomas,et al.  Is the Policy Gradient a Gradient? , 2019, AAMAS.

[39]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[40]  Pieter Abbeel,et al.  Automatic Curriculum Learning through Value Disagreement , 2020, NeurIPS.

[41]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[42]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[43]  Gautier Stauffer,et al.  The Stochastic Shortest Path Problem : A polyhedral combinatorics perspective , 2017, Eur. J. Oper. Res..

[44]  J. Liao,et al.  Sharpening Jensen's Inequality , 2017, The American Statistician.

[45]  Richard L. Lewis,et al.  Reward Design via Online Gradient Ascent , 2010, NIPS.

[46]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[47]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[48]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[49]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[50]  Peter Stone,et al.  Reward (Mis)design for Autonomous Driving , 2021, ArXiv.

[51]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[52]  Peter Stone,et al.  Generative Adversarial Imitation from Observation , 2018, ArXiv.

[53]  Andrew G. Barto,et al.  Intrinsic Motivation and Reinforcement Learning , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[54]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[55]  Richard L. Lewis,et al.  Internal Rewards Mitigate Agent Boundedness , 2010, ICML.

[56]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[57]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[58]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[59]  Filip Jevtić Combinatorial Structure of Finite Metric Spaces , 2018 .

[60]  Sergey Levine,et al.  Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification , 2021, NeurIPS.

[61]  Pieter Abbeel,et al.  Goal-conditioned Imitation Learning , 2019, NeurIPS.

[62]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[63]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[64]  Sergey Levine,et al.  DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[65]  Pierre-Yves Oudeyer,et al.  How can we define intrinsic motivation , 2008 .

[66]  Pierre-Yves Oudeyer,et al.  R-IAC: Robust Intrinsically Motivated Exploration and Active Learning , 2009, IEEE Transactions on Autonomous Mental Development.

[67]  G. Baldassarre,et al.  Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[68]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[69]  G. Qiu,et al.  Lipschitz constrained GANs via boundedness and continuity , 2020, Neural Computing and Applications.

[70]  Sergey Levine,et al.  SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments , 2021, ICLR.

[71]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[72]  Gianluca Baldassarre,et al.  What are intrinsic motivations? A biological perspective , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[73]  Doina Precup,et al.  Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning , 2019, ArXiv.