Primal Wasserstein Imitation Learning

Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.

[1]  Vikash Kumar,et al.  MuJoCo HAPTIX: A virtual reality system for hand manipulation , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[2]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[3]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[4]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[5]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[6]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[7]  Michael I. Jordan,et al.  Learning to Score Behaviors for Guided Policy Optimization , 2020, ICML.

[8]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[9]  Matthieu Geist,et al.  Learning from Demonstrations: Is It Worth Estimating a Reward Function? , 2013, ECML/PKDD.

[10]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[11]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[12]  Anca D. Dragan,et al.  SQIL: Imitation Learning via Regularized Behavioral Cloning , 2019, ArXiv.

[13]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[14]  Mohammed Abdullah,et al.  A note on reinforcement learning with Wasserstein distance regularisation, with applications to multipolicy learning , 2018, ArXiv.

[15]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[16]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[17]  Ana Paiva,et al.  Learning from a Learner , 2019, ICML.

[18]  Yannick Schroecker,et al.  State Aware Imitation Learning , 2017, NIPS.

[19]  M. V. D. Panne,et al.  Displacement Interpolation Using Lagrangian Mass Transport , 2011 .

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Kee-Eung Kim,et al.  Imitation Learning via Kernel Mean Embedding , 2018, AAAI.

[22]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[23]  Matthieu Geist,et al.  Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[24]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[25]  David M. Bradley,et al.  Boosting Structured Prediction for Imitation Learning , 2006, NIPS.

[26]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[27]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[28]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[29]  Pablo Samuel Castro,et al.  Scalable methods for computing state similarity in deterministic Markov Decision Processes , 2019, AAAI.

[30]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[31]  Yannick Schroecker,et al.  Imitating Latent Policies from Observation , 2018, ICML.

[32]  Yiannis Demiris,et al.  Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation , 2019, ICML.

[33]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[34]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Jonathan Tompson,et al.  Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Huang Xiao,et al.  Wasserstein Adversarial Imitation Learning , 2019, ArXiv.

[39]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[40]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[41]  Hao Su,et al.  State Alignment-based Imitation Learning , 2019, ICLR.

[42]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[43]  Matthieu Geist,et al.  A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning , 2013, ECML/PKDD.

[44]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[45]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[46]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[47]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[48]  Stefano Ermon,et al.  InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[49]  Mikael Henaff,et al.  Disagreement-Regularized Imitation Learning , 2020, ICLR.

[50]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[51]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[52]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[53]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[54]  Matthieu Geist,et al.  Boosted and reward-regularized classification for apprenticeship learning , 2014, AAMAS.

[55]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[56]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[57]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[58]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[59]  Marco Pavone,et al.  Risk-Sensitive Generative Adversarial Imitation Learning , 2018, AISTATS.

[60]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[61]  Matthieu Geist,et al.  Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[62]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[63]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[64]  C. Villani Optimal Transport: Old and New , 2008 .

[65]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[66]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[67]  Manuel Lopes,et al.  Learning from Demonstration Using MDP Induced Metrics , 2010, ECML/PKDD.

[68]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[69]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..