Discrete Action On-Policy Learning with Action-Value Critic

Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension, making it challenging to apply existing on-policy gradient based deep RL algorithms efficiently. To effectively operate in multidimensional discrete action spaces, we construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. We follow rigorous statistical analysis to design how to generate and combine these correlated actions, and how to sparsify the gradients by shutting down the contributions from certain dimensions. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques. We demonstrate these properties on OpenAI Gym benchmark tasks, and illustrate how discretizing the action space could benefit the exploration phase and hence facilitate convergence to a better local optimal solution thanks to the flexibility of discrete policy.

[1]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[2]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[3]  Patrick MacAlpine,et al.  UT Austin Villa: RoboCup 2016 3D Simulation League Competition and Technical Challenges Champions , 2015, Robot Soccer World Cup.

[4]  Yujing Hu,et al.  Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application , 2018, KDD.

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Tapani Raiko,et al.  Techniques for Learning Binary Stochastic Feedforward Neural Networks , 2014, ICLR.

[8]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[9]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[10]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[11]  Yunhao Tang,et al.  Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.

[12]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[13]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[14]  Mingyuan Zhou,et al.  ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks , 2018, ICLR.

[15]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[16]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[17]  Dengyong Zhou,et al.  Action-depedent Control Variates for Policy Optimization via Stein's Identity , 2017 .

[18]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[19]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[20]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[21]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[22]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[25]  M. de Rijke,et al.  Reinforcement Learning to Rank , 2019, WSDM.

[26]  Patrick MacAlpine,et al.  UT Austin Villa: RoboCup 2015 3D Simulation League Competition and Technical Challenges Champions , 2015, RoboCup.

[27]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[28]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[29]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[30]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[31]  Yang Liu,et al.  Stein Variational Policy Gradient , 2017, UAI.

[32]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[33]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[34]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[35]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[36]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[37]  Mingyuan Zhou,et al.  ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables , 2019, ICML.

[38]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[39]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[40]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.