论文信息 - Counterfactual Credit Assignment in Model-Free Reinforcement Learning

Counterfactual Credit Assignment in Model-Free Reinforcement Learning

Credit assignment in reinforcement learning is the problem of measuring an action influence on future rewards. In particular, this requires separating skill from luck, ie. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.

[1] Filipe Wall Mutz,et al. Hindsight policy gradients , 2017, ICLR.

[2] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[3] Jürgen Schmidhuber,et al. World Models , 2018, ArXiv.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Dmitry Vetrov,et al. Towards Practical Credit Assignment for Deep Reinforcement Learning , 2021, ArXiv.

[6] Doina Precup,et al. Policy Gradients Incorporating the Future , 2021, ICLR.

[7] Eric Nalisnick,et al. Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[8] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[9] Marvin Minsky,et al. Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[10] Fabio Viola,et al. Taming VAEs , 2018, ArXiv.

[11] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[12] P. Glasserman,et al. Some Guidelines and Guarantees for Common Random Numbers , 1992 .

[13] Yan Wu,et al. Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[14] Pieter Abbeel,et al. Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[15] Chris Nota,et al. Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods , 2021, ICML.

[16] Doina Precup,et al. Hindsight Credit Assignment , 2019, NeurIPS.

[17] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[18] Trevor Darrell,et al. Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Junzhe Zhang,et al. Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach , 2020, ICML.

[20] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[21] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22] Doina Precup,et al. Value-driven Hindsight Modelling , 2020, NeurIPS.

[23] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[24] Bernhard Schölkopf,et al. Recurrent Independent Mechanisms , 2021, ICLR.

[25] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[26] Sepp Hochreiter,et al. RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[27] Tie-Yan Liu,et al. Independence-aware Advantage Estimation , 2019, IJCAI.

[28] Razvan Pascanu,et al. Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[29] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[30] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.