Imitation Learning via Kernel Mean Embedding

Imitation learning refers to the problem where an agent learns a policy that mimics the demonstration provided by the expert, without any information on the cost function of the environment. Classical approaches to imitation learning usually rely on a restrictive class of cost functions that best explains the expert’s demonstration, exemplified by linear functions of pre-defined features on states and actions. We show that the kernelization of a classical algorithm naturally reduces the imitation learning to a distribution learning problem, where the imitation policy tries to match the state-action visitation distribution of the expert. Closely related to our approach is the recent work on leveraging generative adversarial networks (GANs) for imitation learning, but our reduction to distribution learning is much simpler, robust to scarce expert demonstration, and sample efficient. We demonstrate the effectiveness of our approach on a wide range of high-dimensional control tasks. In imitation learning, an agent learns to behave by mimicking the demonstration provided by the expert, situated in an environment with an unknown cost function. A classical approach to imitation learning is behavioral cloning, where the policy is directly learned to map from states to actions by supervised learning (Sammut 2010). Unfortunately, this straightforward approach does not generalize well to unseen states, often requiring a large amount of training data. A more principled approach is apprenticeship learning (AL), where the policy is sought that is guaranteed to perform at least as well as the expert (Russell 1998; Ng, Russell, and others 2000; Abbeel and Ng 2004). However, to formally meet the guarantee, AL algorithms typically assume a restrictive class of cost functions and a planner that yields a sufficiently accurate optimal policy for a cost function. This does not reflect the complex nature of high-dimensional dynamics in real-world problems. On the other hand, deep neural networks have been shown strong predictive power to model complex functions: the parametric function via networks is highly flexible and expressive, which can be efficiently trained by stochastic gradient descent. Representing the cost function and the agent policy using neural networks shall yield a plausible policy that faithfully imitates the expert’s demonstrated behaviors Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. in high-dimensional control tasks. In this line of thought, Ho and Ermon (2016) presented generative adversarial imitation learning (GAIL), which casts the objective of imitation learning as the training objective of generative adversarial networks (GANs) (Goodfellow et al. 2014). The key insight behind GAIL is that imitation learning reduces to matching the state-action visitation distributions (i.e. occupancy measure) of the learned policy to that of the expert policy, under a suitable choice of the penalty on the cost function. However, GAIL often exhibits unstable training in practice due to the alternating optimizations of the generator and discriminator networks to address the minimax objective function, a well-known challenge in training GANs. In this work, we show that extending the class of cost functions to the reproducing kernel Hilbert space (RKHS) alternatively reduces the imitation learning to the distribution learning problem under the maximum mean discrepancy (MMD), a metric on probability distributions defined in the RKHS. However, our derivation is much simpler and more natural. Although the derivation is almost immediate, our work is the first to present the kernelization of a classical AL algorithm (Abbeel and Ng 2004), and establish analogies with the state of the art imitation learning algorithm, i.e. GAIL. The advantage of our work is that the training becomes simpler yet robust to local optima since the hard minimax optimization is avoided. As an end result, our work becomes closely related to generative moment matching networks (GMMNs) (Li, Swersky, and Zemel 2015) and MMD nets (Dziugaite, Roy, and Ghahramani 2015), two approaches to training deep generative neural networks using the MMD. Our experiments on the same set of highdimensional control imitation tasks with the identical settings as in the GAIL paper, with the largest task involving 376 observation and 17 action dimensions, demonstrate that the proposed approach performs better than or on a par with GAIL, and significantly outperforms GAIL particularly when the expert demonstration is scarce, with performance gain up to 41%. Background MDPs and Imitation Learning We define basic notation for our problem setting and briefly review relevant work in the literature. We assume learning in an environment that can be modeled as a Markov decision process (MDP), with The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

[1]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[2]  Alex Smola,et al.  OPTIMIZED MAXIMUM MEAN DISCREPANCY , 2016 .

[3]  Claude Sammut Behavioral Cloning , 2017, Encyclopedia of Machine Learning and Data Mining.

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[6]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[9]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Barnabás Póczos,et al.  On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions , 2014, AAAI.

[11]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[12]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[13]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[14]  Kristian Kirsch,et al.  Methods Of Modern Mathematical Physics , 2016 .

[15]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[16]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[17]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[18]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[19]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[20]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[21]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[22]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[23]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[24]  Stefano Ermon,et al.  Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[25]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[26]  Ferenc Huszar,et al.  How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? , 2015, ArXiv.

[27]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.