Strictly Batch Imitation Learning by Energy-based Distribution Matching

Consider learning a policy purely on the basis of demonstrated behavior---that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. This *strictly batch imitation learning* problem arises wherever live experimentation is costly, such as in healthcare. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach bargains heavily on model estimation or off-policy evaluation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy (i.e. respecting action conditionals), implicitly account for rollout dynamics (i.e. respecting state marginals), and---crucially---operate in an entirely offline fashion. To meet this challenge, we propose a novel technique by *energy-based distribution matching* (EDM): By identifying parameterizations of the (discriminative) model of a policy with the (generative) energy function for state distributions, EDM provides a simple and effective solution that equivalently minimizes a divergence between the occupancy measures of the demonstrator and the imitator. Through experiments with application to control tasks and healthcare settings, we illustrate consistent performance gains over existing algorithms for strictly batch imitation learning.

[1]  Yiannis Demiris,et al.  Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation , 2019, ICML.

[2]  Mihaela van der Schaar,et al.  Inverse Active Sensing: Modeling and Understanding Timely Decision-Making , 2020, ICML.

[3]  Bikramjit Banerjee,et al.  Model-Free IRL Using Maximum Likelihood Estimation , 2019, AAAI.

[4]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[5]  Marcello Restelli,et al.  Inverse Reinforcement Learning through Policy Gradient Minimization , 2016, AAAI.

[6]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[7]  Huang Xiao,et al.  Wasserstein Adversarial Imitation Learning , 2019, ArXiv.

[8]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[9]  Shie Mannor,et al.  Model-based Adversarial Imitation Learning , 2016, ArXiv.

[10]  Alexandre Attia,et al.  Global overview of Imitation Learning , 2018, ArXiv.

[11]  Christos Dimitrakakis,et al.  Probabilistic inverse reinforcement learning in unknown environments , 2013, UAI.

[12]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[13]  Weinan Zhang,et al.  Energy-Based Imitation Learning , 2020, AAMAS.

[14]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[15]  Mohammad Norouzi,et al.  Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One , 2019, ICLR.

[16]  Mikael Henaff,et al.  Disagreement-Regularized Imitation Learning , 2020, ICLR.

[17]  Matthieu Geist,et al.  Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[19]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[20]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[21]  Marcello Restelli,et al.  Compatible Reward Inverse Reinforcement Learning , 2017, NIPS.

[22]  Alexandros Kalousis,et al.  Sample-Efficient Imitation Learning via Generative Adversarial Nets , 2018, AISTATS.

[23]  Yisong Yue,et al.  Smooth Imitation Learning for Online Sequence Prediction , 2016, ICML.

[24]  Wolfram Burgard,et al.  Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics , 2016, AISTATS.

[25]  Srivatsan Srinivasan,et al.  Truly Batch Apprenticeship Learning with Deep Successor Features , 2019, IJCAI.

[26]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[27]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28]  Hao Su,et al.  State Alignment-based Imitation Learning , 2019, ICLR.

[29]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[30]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[31]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[32]  Claude Sammut,et al.  A Framework for Behavioural Cloning , 1995, Machine Intelligence 15.

[33]  Matthieu Geist,et al.  Batch, Off-Policy and Model-Free Apprenticeship Learning , 2011, EWRL.

[34]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[35]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[36]  Matthieu Geist,et al.  Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[37]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[38]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[39]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[40]  Robert E. Schapire,et al.  Imitation Learning with a Value-Based Prior , 2007, UAI.

[41]  Kyungjae Lee,et al.  Density Matching Reward Learning , 2016, ArXiv.

[42]  Michael C. Yip,et al.  Adversarial Imitation via Variational Inverse Reinforcement Learning , 2018, ICLR.

[43]  Alborz Geramifard,et al.  RLPy: a value-function-based reinforcement learning framework for education and research , 2015, J. Mach. Learn. Res..

[44]  Anca D. Dragan,et al.  SQIL: Imitation Learning via Regularized Behavioral Cloning , 2019, ArXiv.

[45]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[46]  Yannick Schroecker,et al.  State Aware Imitation Learning , 2017, NIPS.

[47]  Radu Timofte,et al.  How to Train Your Energy-Based Model for Regression , 2020, BMVC.

[48]  Manuel Lopes,et al.  Learning from Demonstration Using MDP Induced Metrics , 2010, ECML/PKDD.

[49]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[50]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..

[51]  Matthieu Geist,et al.  Primal Wasserstein Imitation Learning , 2020, ICLR.

[52]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[53]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[54]  Kee-Eung Kim,et al.  MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[55]  Robert E. Schapire,et al.  A Reduction from Apprenticeship Learning to Classification , 2010, NIPS.

[56]  Kee-Eung Kim,et al.  A Bayesian Approach to Generative Adversarial Imitation Learning , 2018, NeurIPS.

[57]  Sethu Vijayakumar,et al.  Model-free apprenticeship learning for transfer of human impedance behaviour , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[58]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[59]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[60]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[61]  Ilya Kostrikov,et al.  Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[62]  Matthieu Geist,et al.  Boosted and reward-regularized classification for apprenticeship learning , 2014, AAMAS.

[63]  Tian Han,et al.  On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models , 2019, AAAI.

[64]  Andrea Bonarini,et al.  Gradient-based minimization for multi-expert Inverse Reinforcement Learning , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[65]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[66]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[67]  Kee-Eung Kim,et al.  Imitation Learning via Kernel Mean Embedding , 2018, AAAI.