论文信息 - The Advantage Regret-Matching Actor-Critic - 字舞流文

The Advantage Regret-Matching Actor-Critic

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

Michael H. Bowling | Michael Bowling | Marc Lanctot | Karl Tuyls | Mohammad Gheshlaghi Azar | Dustin Morrill | Jean-Baptiste Lespiau | Julien Perolat | Audrunas Gruslys | Martin Schmid | Finbarr Timbers | R'emi Munos | Vinicius Zambaldi | John Schultz | R. Munos | M. G. Azar | V. Zambaldi | Marc Lanctot | A. Gruslys | K. Tuyls | J. Pérolat | J. Lespiau | Martin Schmid | Dustin Morrill | Finbarr Timbers | John Schultz

[1] S. Hart,et al. A simple adaptive procedure leading to correlated equilibrium , 2000 .

[2] Tuomas Sandholm,et al. Deep Counterfactual Regret Minimization , 2018, ICML.

[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4] Noam Brown,et al. Superhuman AI for multiplayer poker , 2019, Science.

[5] Noam Brown,et al. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[6] Lasse Becker-Czarnetzki. Report on DeepStack Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker , 2019 .

[7] Michael H. Bowling,et al. Tractable Objectives for Robust Policy Optimization , 2012, NIPS.

[8] Yoav Shoham,et al. Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[9] Adam Lerer,et al. DREAM: Deep Regret minimization with Advantage baselines and Model-free learning , 2020, ArXiv.

[10] Nicola Gatti,et al. Learning to Correlate in Multi-Player General-Sum Sequential Games , 2019, NeurIPS.

[11] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12] Peter L. Bartlett,et al. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[13] Honglak Lee,et al. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units , 2016, ICML.

[14] Daniel Guo,et al. Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[15] Kurt Keutzer,et al. Regret Minimization for Partially Observable Deep Reinforcement Learning , 2017, ICML.

[16] Michael H. Bowling,et al. Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[17] Michael H. Bowling,et al. Solving Imperfect Information Games Using Decomposition , 2013, AAAI.

[18] Neil Burch,et al. Heads-up limit hold’em poker is solved , 2015, Science.

[19] Michael H. Bowling,et al. Rethinking Formal Models of Partially Observable Multiagent Decision Making , 2019, Artif. Intell..

[20] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[21] Y. Mansour,et al. Algorithmic Game Theory: Learning, Regret Minimization, and Equilibria , 2007 .

[22] Karl Tuyls,et al. Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent , 2019, IJCAI.

[23] David Silver,et al. Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[24] Kevin Waugh,et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[25] Michael H. Bowling,et al. Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines , 2018, AAAI.

[26] Ian A. Kash,et al. Combining No-regret and Q-learning , 2019, AAMAS.

[27] Kevin Waugh,et al. Solving Games with Functional Regret Estimation , 2014, AAAI Workshop: Computer Poker and Imperfect Information.

[28] Tuomas Sandholm,et al. Solving Imperfect-Information Games via Discounted Regret Minimization , 2018, AAAI.

[29] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[30] Yuan Qi,et al. Double Neural Counterfactual Regret Minimization , 2018, ICLR.

[31] Nicola Gatti,et al. No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium , 2020, NeurIPS.

[32] Kevin Waugh,et al. Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[33] Yishay Mansour,et al. Experts in a Markov Decision Process , 2004, NIPS.

[34] Sriram Srinivasan,et al. OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[35] Michael H. Bowling,et al. Eqilibrium Approximation Quality of Current No-Limit Poker Bots , 2016, AAAI Workshops.

[36] Michael H. Bowling,et al. Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[37] Duane Szafron,et al. Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions , 2012, NIPS.

[38] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.