Quantile Credit Assignment

In reinforcement learning, the credit assignment problem is to distinguish luck from skill, that is, separate the inherent randomness in the environment from the controllable effects of the agent’s actions. This paper proposes two novel algorithms, Quantile Credit Assignment (QCA) and Hindsight QCA (HQCA), which incorporate distributional value estimation to perform credit assignment. QCA uses a network that predicts the quantiles of the return distribution, whereas HQCA additionally incorporates information about the future. Both QCA and HQCA have the appealing interpretation of leveraging an estimate of the quantile level of the return (interpreted as the level of “luck”) in order to derive a “luck-dependent” baseline for policy gradient methods. We show theoretically that this approach gives an unbiased policy gradient estimator that can yield significant variance reductions over a standard value estimate baseline. QCA and HQCA significantly outperform prior state-of-the-art methods on a range of extremely difficult credit assignment problems.

[1]  Georg Ostrovski,et al.  Distributional Reinforcement Learning , 2023 .

[2]  Matthew W. Hoffman,et al.  Revisiting Gaussian mixture critics in off-policy reinforcement learning: a sample-based approach , 2022, ArXiv.

[3]  Daniel Wontae Nam,et al.  GMAC: A Distributional Perspective on Actor-Critic Framework , 2021, ICML.

[4]  Marcus Hutter,et al.  Counterfactual Credit Assignment in Model-Free Reinforcement Learning , 2020, ICML.

[5]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[6]  Yang Guan,et al.  Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Doina Precup,et al.  Hindsight Credit Assignment , 2019, NeurIPS.

[8]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[9]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[10]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[11]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[12]  Dengyong Zhou,et al.  Action-depedent Control Variates for Policy Optimization via Stein's Identity , 2017 .

[13]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[14]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[15]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[16]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[17]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[18]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[22]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23]  P. Poupart,et al.  Distributional Reinforcement Learning with Monotonic Splines , 2022, ICLR.

[24]  Xingdong Feng,et al.  Non-Crossing Quantile Regression for Distributional Reinforcement Learning , 2020, NeurIPS.

[25]  R. Koenker,et al.  Regression Quantiles , 2007 .

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.