Induced Exploration on Policy Gradients by Increasing Actor Entropy Using Advantage Target Regions

We propose a policy gradient actor-critic algorithm with a built-in exploration mechanism. Unlike existing policy gradient methods that use several actors asynchronously for exploration, our algorithm uses only a single actor that can robustly search for the optimal path. Our algorithm uses modified advantage targets that increase entropy in an actor’s predicted advantage probability distribution. We do this using a two-step process, where the first step modifies advantage targets from points to regions, by sampling particles in neighborhoods along the direction of the critic value function. This step increases entropy in an actor’s estimates and explicitly induces the actor to perform actions outside of past policies for exploration. The second step controls for variance increase due to sampling, where shortest-path dynamic programming selects particles from the regions with minimum inter-state movements. We present an analysis of our method and compare it with other exploration policy gradient algorithm, i.e. A3C, and report faster convergence in some VizDoom and Atari benchmarks given the same number of backpropagation steps on a deep network function approximator.