Hybrid Adversarial Inverse Reinforcement Learning

In this paper, we investigate the problem of the inverse reinforcement learning (IRL), especially the beyonddemonstrator (BD) IRL. The BD-IRL aims to not only imitate the expert policy but also extrapolate BD policy based on finite demonstrations of the expert. Currently, most of the BD-IRL algorithms are two-stage, which first infer a reward function then learn the policy via reinforcement learning (RL). Because of the two separate procedures, the two-stage algorithms have high computation complexity and lack robustness. To overcome these flaw, we propose a BD-IRL framework entitled hybrid adversarial inverse reinforcement learning (HAIRL), which successfully integrates the imitation and exploration into one procedure. The simulation results show that the HAIRL is more efficient and robust when compared with other similar state-of-the-art (SOTA) algorithms.

[1]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[2]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[3]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[4]  A. Meltzoff,et al.  Imitation of facial and manual gestures by human neonates , 1977, Science.

[5]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[6]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[7]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[8]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[9]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Scott Niekum,et al.  Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations , 2019, CoRL.

[11]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[12]  A. Meltzoff,et al.  Imitation of Facial and Manual Gestures by Human Neonates , 1977, Science.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[17]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  J. Mazziotta,et al.  Cortical mechanisms of human imitation. , 1999, Science.

[21]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[22]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[23]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[24]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[25]  Sergey Levine,et al.  Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow , 2018, ICLR.

[26]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[27]  Xingrui Yu,et al.  Intrinsic Reward Driven Imitation Learning via Generative Model , 2020, ICML.

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[30]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.