论文信息 - Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions

Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions

We propose a policy gradient actor-critic algorithm with a built-in exploration mechanism. Unlike existing policy gradient methods that use several actors asynchronously for exploration, our algorithm uses only a single actor that can robustly search for the optimal path. Our algorithm uses modified advantage targets that increase entropy in an actor’s predicted advantage probability distribution. We do this using a two-step process, where the first step modifies advantage targets from points to regions, by sampling particles in neighborhoods along the direction of the critic value function. This step increases entropy in an actor’s estimates and explicitly induces the actor to perform actions outside of past policies for exploration. The second step controls for variance increase due to sampling, where shortest-path dynamic programming selects particles from the regions with minimum inter-state movements. We present an analysis of our method and compare it with other exploration policy gradient algorithm, i.e. A3C, and report faster convergence in some VizDoom and Atari benchmarks given the same number of backpropagation steps on a deep network function approximator.

Alfonso B. Labao | Prospero C. Naval | Carlo R. Raquel

[1] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[2] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[3] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[4] Nahum Shimkin,et al. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[5] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[6] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[7] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[8] Richard E. Turner,et al. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[9] Wojciech Jaskowski,et al. ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[10] Christopher Schulze,et al. ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning, & Snapshot Ensembling , 2018, IntelliSys.

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.