Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space

In this paper we propose a hybrid architecture of actor-critic algorithms for reinforcement learning in parameterized action space, which consists of multiple parallel sub-actor networks to decompose the structured action space into simpler action spaces along with a critic network to guide the training of all sub-actor networks. While this paper is mainly focused on parameterized action space, the proposed architecture, which we call hybrid actor-critic, can be extended for more general action spaces which has a hierarchical structure. We present an instance of the hybrid actor-critic architecture based on proximal policy optimization (PPO), which we refer to as hybrid proximal policy optimization (H-PPO). Our experiments test H-PPO on a collection of tasks with parameterized action space, where H-PPO demonstrates superior performance over previous methods of parameterized action reinforcement learning.

[1]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[2]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[3]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[4]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[5]  Tapani Raiko,et al.  International Conference on Learning Representations (ICLR) , 2016 .

[6]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[7]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[8]  Matthew Hausknecht and Peter Stone,et al.  Half Field Offense: An Environment for Multiagent Learning and Ad Hoc Teamwork , 2016 .

[9]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[10]  Peter Stone,et al.  Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[11]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[12]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Qing Wang,et al.  Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space , 2018, ArXiv.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.