Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopically Rational Followers?

We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers. In particular, we focus on the class of games where the followers are myopically rational; i.e., they aim to maximize their instantaneous rewards. For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair (π∗, ν∗) such that: (i) π∗ is the optimal policy for the leader when the followers always play their best response, and (ii) ν∗ is the best response policy of the followers, which is a Nash equilibrium of the followers’ game induced by π∗. We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings. Our algorithms are optimistic and pessimistic variants of least-squares value iteration, and they are readily able to incorporate function approximation tools in the setting of large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively. To the best of our knowledge, we establish the first provably efficient RL algorithms for solving for SNEs in general-sum Markov games with myopically rational followers.

[1]  Chi Jin,et al.  V-Learning - A Simple, Efficient, Decentralized Algorithm for Multiagent RL , 2021, ArXiv.

[2]  T. Başar,et al.  Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games , 2021, Dynamic Games and Applications.

[3]  Song Mei,et al.  When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently? , 2021, ICLR.

[4]  Agnieszka Wiszniewska-Matyszkiel,et al.  Dynamic Stackelberg duopoly with sticky prices and a myopic follower , 2021, Operational Research.

[5]  Stuart J. Russell,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[6]  Huan Wang,et al.  Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games , 2021, NeurIPS.

[7]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[8]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[9]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[10]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[11]  Lin F. Yang,et al.  Minimax Sample Complexity for Turn-based Stochastic Game , 2020, UAI.

[12]  Michael I. Jordan,et al.  Bridging Exploration and General Function Approximation in Reinforcement Learning: Provably Efficient Kernel and Neural Value Iterations , 2020, ArXiv.

[13]  Tiancheng Yu,et al.  Provably Efficient Online Agnostic Learning in Markov Games , 2020, ArXiv.

[14]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[15]  S. Du,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[16]  Marc G. Bellemare,et al.  The Importance of Pessimism in Fixed-Dataset Policy Optimization , 2020, ICLR.

[17]  Emma Brunskill,et al.  Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.

[18]  Lin F. Yang,et al.  Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity , 2020, NeurIPS.

[19]  Zhaoran Wang,et al.  A Two-Timescale Stochastic Algorithm Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , 2020, SIAM J. Optim..

[20]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[21]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[22]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[23]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[24]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[25]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[26]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[27]  T. Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[28]  David C. Parkes,et al.  The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies , 2020, ArXiv.

[29]  Xiangyang Ji,et al.  Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[30]  Vikash Kumar,et al.  A Game Theoretic Framework for Model Based Reinforcement Learning , 2020, ICML.

[31]  Nan Jiang,et al.  $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[32]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[33]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[34]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[35]  Zhuoran Yang,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT.

[36]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[37]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[38]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[39]  T. Başar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[40]  N. Gatti,et al.  Computing a Pessimistic Stackelberg Equilibrium with Multiple Followers: The Mixed-Pure Case , 2019, Algorithmica.

[41]  Alessandro Lazaric,et al.  Frequentist Regret Bounds for Randomized Least-Squares Value Iteration , 2019, AISTATS.

[42]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[43]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[44]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[45]  Afshin Oroojlooyjadid,et al.  A review of cooperative multi-agent deep reinforcement learning , 2019, Applied Intelligence.

[46]  Lin F. Yang,et al.  Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity , 2019, AISTATS.

[47]  Pingzhong Tang,et al.  Learning Optimal Strategies to Commit To , 2019, AAAI.

[48]  Pierre Baldi,et al.  Solving the Rubik’s cube with deep reinforcement learning and search , 2019, Nature Machine Intelligence.

[49]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[50]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[51]  Lillian J. Ratliff,et al.  Convergence of Learning Dynamics in Stackelberg Games , 2019, ArXiv.

[52]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[53]  Mengdi Wang,et al.  Feature-Based Q-Learning for Two-Player Stochastic Games , 2019, ArXiv.

[54]  Fernando Ordóñez,et al.  Stationary Strong Stackelberg Equilibrium in Discounted Stochastic Games , 2019, IEEE Transactions on Automatic Control.

[55]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[56]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[57]  Fernando Ordóñez,et al.  On the Value Iteration method for dynamic Strong Stackelberg Equilibria , 2019 .

[58]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[59]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[60]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[61]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[62]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[63]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[64]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[65]  Lillian J. Ratliff,et al.  Adaptive Incentive Design , 2018, IEEE Transactions on Automatic Control.

[66]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[67]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[68]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[69]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[70]  Stefano Coniglio,et al.  Bilevel Programming Approaches to the Computation of Optimistic and Pessimistic Single-Leader-Multi-Follower Equilibria , 2017, SEA.

[71]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[72]  Suresh P. Sethi,et al.  A review of dynamic Stackelberg game models , 2016 .

[73]  N. Gatti,et al.  Methods for finding leader-follower equilibria with multiple followers , 2017, AAMAS 2017.

[74]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[75]  Meixia Tao,et al.  Caching incentive design in wireless D2D networks: A Stackelberg game approach , 2016, 2016 IEEE International Conference on Communications (ICC).

[76]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[77]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[78]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[79]  Ariel D. Procaccia,et al.  Learning Optimal Commitment to Overcome Insecurity , 2014, NIPS.

[80]  Ming Jin,et al.  Social game for building energy efficiency: Incentive design , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[81]  Yongdong Wu,et al.  Incentive Mechanism Design for Heterogeneous Peer-to-Peer Networks: A Stackelberg Game Approach , 2014, IEEE Transactions on Mobile Computing.

[82]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[83]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[84]  Milind Tambe,et al.  Security and Game Theory - Algorithms, Deployed Systems, Lessons Learned , 2011 .

[85]  Vincent Conitzer,et al.  Stackelberg vs. Nash in Security Games: An Extended Investigation of Interchangeability, Equivalence, and Uniqueness , 2011, J. Artif. Intell. Res..

[86]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[87]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[88]  Bernhard von Stengel,et al.  Leadership games with convex strategy sets , 2010, Games Econ. Behav..

[89]  Vincent Conitzer,et al.  Learning and Approximating the Optimal Strategy to Commit To , 2009, SAGT.

[90]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[91]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[92]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[93]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[94]  Vincent Conitzer,et al.  Computing the optimal strategy to commit to , 2006, EC '06.

[95]  Y. Narahari,et al.  Design of Incentive Compatible Mechanisms for Stackelberg Problems , 2005, WINE.

[96]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[97]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[98]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[99]  Vincent Conitzer,et al.  Complexity of Mechanism Design , 2002, UAI.

[100]  Tim Roughgarden,et al.  Stackelberg scheduling strategies , 2001, STOC '01.

[101]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[102]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[103]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[104]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[105]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[106]  A. Haurie,et al.  Sequential Stackelberg equilibria in two-person games , 1985 .

[107]  J. Cruz,et al.  On the Stackelberg strategy in nonzero-sum games , 1973 .

[108]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[109]  Quanquan Gu,et al.  Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation , 2021, ArXiv.

[110]  W. Yin,et al.  A Single-Timescale Stochastic Bilevel Optimization Method , 2021, ArXiv.

[111]  Yuandong Tian,et al.  Provably Efficient Policy Gradient Methods for Two-Player Zero-Sum Markov Games , 2021, ArXiv.

[112]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[113]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[114]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[115]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[116]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[117]  Jose B. Cruz,et al.  Stackelberg strategies and incentives in multiperson deterministic decision problems , 1984, IEEE Transactions on Systems, Man, and Cybernetics.