Meta-Inverse Reinforcement Learning with Probabilistic Context Variables

Reinforcement learning demands a reward function, which is often difficult to provide or design in real world applications. While inverse reinforcement learning (IRL) holds promise for automatically learning reward functions from demonstrations, several major challenges remain. First, existing IRL methods learn reward functions from scratch, requiring large numbers of demonstrations to correctly infer the reward for each task the agent may need to perform. Second, and more subtly, existing methods typically assume demonstrations for one, isolated behavior or task, while in practice, it is significantly more natural and scalable to provide datasets of heterogeneous behaviors. To this end, we propose a deep latent variable model that is capable of learning rewards from unstructured, multi-task demonstration data, and critically, use this experience to infer robust rewards for new, structurally-similar tasks from a single demonstration. Our experiments on multiple continuous control tasks demonstrate the effectiveness of our approach compared to state-of-the-art imitation and inverse reinforcement learning methods.

[1]  Katja Hofmann,et al.  Meta Reinforcement Learning with Latent Variable Gaussian Processes , 2018, UAI.

[2]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[3]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[4]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[5]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[6]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Hong Yu,et al.  Meta Networks , 2017, ICML.

[9]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[10]  Sergey Levine,et al.  One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning , 2018, Robotics: Science and Systems.

[11]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[12]  Gaurav S. Sukhatme,et al.  Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets , 2017, NIPS.

[13]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Mohit Sharma,et al.  Directed-Info GAIL: Learning Hierarchical Policies from Unsegmented Demonstrations using Directed Information , 2018, ICLR.

[16]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[17]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[18]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[19]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[20]  Anca D. Dragan,et al.  Learning a Prior over Intent via Meta-Inverse Reinforcement Learning , 2018, ICML.

[21]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[22]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[23]  Adam Gleave,et al.  Multi-task Maximum Entropy Inverse Reinforcement Learning , 2018, ArXiv.

[24]  Mykel J. Kochenderfer,et al.  Burn-In Demonstrations for Multi-Modal Imitation Learning , 2017, AAMAS.

[25]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[26]  Sergey Levine,et al.  One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks , 2018, ArXiv.

[27]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[28]  Sergey Levine,et al.  Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow , 2018, ICLR.

[29]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[30]  Bartunov Sergey,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016 .

[31]  Stefano Ermon,et al.  InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[32]  Ken Goldberg,et al.  Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation , 2017, ICRA.

[33]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Stefan Schaal,et al.  Computational approaches to motor learning by imitation. , 2003, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[35]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[36]  Stefano Ermon,et al.  A Lagrangian Perspective on Latent Variable Generative Models , 2018, UAI.

[37]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[38]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[39]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[40]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[41]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.