Latent Context Based Soft Actor-Critic

The performance of deep reinforcement learning methods prone to degenerate when applied to tasks requiring relatively longer horizon memory or with highly variable dynamics. In this paper, we utilize the probabilistic latent context variables motivated by recent Meta-RL materials, and propose the Latent Context based Soft Actor-Critic (LC-SAC) approach to address aforementioned issues. The latent context is capable to encode information about both the agent’s previous behaviors and the dynamics of the current undergoing environment, which empirically believed to be beneficial for efficient policy optimization. Experiment results demonstrate that LC-SAC can achieve comparable performance with SAC on a collection of continuous control benchmarks and outperforms SAC in some particular tasks with above two characteristics. Moreover, we also introduce a simple but general procedure to integrate LC-SAC with diverse-quality demonstrations to enable efficient reuse of human prior knowledge, and finally achieve competitive performance with comparatively small number of interactions with environments.

[1]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[2]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[3]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[4]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[5]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[6]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[7]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[8]  R.M. Dunn,et al.  Brains, behavior, and robotics , 1983, Proceedings of the IEEE.

[9]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[14]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[15]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[16]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[19]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[20]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[23]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[24]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..