Counterfactual Credit Assignment in Model-Free Reinforcement Learning

Credit assignment in reinforcement learning is the problem of measuring an action influence on future rewards. In particular, this requires separating skill from luck, ie. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.

[1]  Filipe Wall Mutz,et al.  Hindsight policy gradients , 2017, ICLR.

[2]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[3]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Dmitry Vetrov,et al.  Towards Practical Credit Assignment for Deep Reinforcement Learning , 2021, ArXiv.

[6]  Doina Precup,et al.  Policy Gradients Incorporating the Future , 2021, ICLR.

[7]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[8]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[9]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[10]  Fabio Viola,et al.  Taming VAEs , 2018, ArXiv.

[11]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[12]  P. Glasserman,et al.  Some Guidelines and Guarantees for Common Random Numbers , 1992 .

[13]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[14]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[15]  Chris Nota,et al.  Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods , 2021, ICML.

[16]  Doina Precup,et al.  Hindsight Credit Assignment , 2019, NeurIPS.

[17]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[18]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Junzhe Zhang,et al.  Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach , 2020, ICML.

[20]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Doina Precup,et al.  Value-driven Hindsight Modelling , 2020, NeurIPS.

[23]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[24]  Bernhard Schölkopf,et al.  Recurrent Independent Mechanisms , 2021, ICLR.

[25]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[26]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[27]  Tie-Yan Liu,et al.  Independence-aware Advantage Estimation , 2019, IJCAI.

[28]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[29]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[30]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[31]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[32]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[33]  Jessica B. Hamrick,et al.  Analogues of mental simulation and imagination in deep learning , 2019, Current Opinion in Behavioral Sciences.

[34]  David Silver,et al.  Credit Assignment Techniques in Stochastic Computation Graphs , 2019, AISTATS.

[35]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[36]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[37]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[38]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[39]  Mihaela van der Schaar,et al.  Estimating Counterfactual Treatment Outcomes over Time Through Adversarially Balanced Representations , 2020, ICLR.

[40]  Kenny Young Variance Reduced Advantage Estimation with δ Hindsight Credit Assignment , 2019, ArXiv.

[41]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[42]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[43]  Matthieu Geist,et al.  Credit Assignment as a Proxy for Transfer in Reinforcement Learning , 2019, ArXiv.

[44]  Hongzi Mao,et al.  Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.

[45]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[46]  Nicolas Heess,et al.  Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search , 2018, ICLR.

[47]  Marc G. Bellemare,et al.  Compress and Control , 2015, AAAI.

[48]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[49]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[50]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[51]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[52]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[53]  T. Weber,et al.  Stochastic Gradient Estimation With Finite Differences , 2016 .