The Advantage Regret-Matching Actor-Critic

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

[1]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[2]  Tuomas Sandholm,et al.  Deep Counterfactual Regret Minimization , 2018, ICML.

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[5]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[6]  Lasse Becker-Czarnetzki Report on DeepStack Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker , 2019 .

[7]  Michael H. Bowling,et al.  Tractable Objectives for Robust Policy Optimization , 2012, NIPS.

[8]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[9]  Adam Lerer,et al.  DREAM: Deep Regret minimization with Advantage baselines and Model-free learning , 2020, ArXiv.

[10]  Nicola Gatti,et al.  Learning to Correlate in Multi-Player General-Sum Sequential Games , 2019, NeurIPS.

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[13]  Honglak Lee,et al.  Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units , 2016, ICML.

[14]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[15]  Kurt Keutzer,et al.  Regret Minimization for Partially Observable Deep Reinforcement Learning , 2017, ICML.

[16]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[17]  Michael H. Bowling,et al.  Solving Imperfect Information Games Using Decomposition , 2013, AAAI.

[18]  Neil Burch,et al.  Heads-up limit hold’em poker is solved , 2015, Science.

[19]  Michael H. Bowling,et al.  Rethinking Formal Models of Partially Observable Multiagent Decision Making , 2019, Artif. Intell..

[20]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[21]  Y. Mansour,et al.  Algorithmic Game Theory: Learning, Regret Minimization, and Equilibria , 2007 .

[22]  Karl Tuyls,et al.  Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent , 2019, IJCAI.

[23]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[24]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[25]  Michael H. Bowling,et al.  Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines , 2018, AAAI.

[26]  Ian A. Kash,et al.  Combining No-regret and Q-learning , 2019, AAMAS.

[27]  Kevin Waugh,et al.  Solving Games with Functional Regret Estimation , 2014, AAAI Workshop: Computer Poker and Imperfect Information.

[28]  Tuomas Sandholm,et al.  Solving Imperfect-Information Games via Discounted Regret Minimization , 2018, AAAI.

[29]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[30]  Yuan Qi,et al.  Double Neural Counterfactual Regret Minimization , 2018, ICLR.

[31]  Nicola Gatti,et al.  No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium , 2020, NeurIPS.

[32]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[33]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[34]  Sriram Srinivasan,et al.  OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[35]  Michael H. Bowling,et al.  Eqilibrium Approximation Quality of Current No-Limit Poker Bots , 2016, AAAI Workshops.

[36]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[37]  Duane Szafron,et al.  Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions , 2012, NIPS.

[38]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.