Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopic Followers?

We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers. In particular, we focus on the class of games where the followers are myopic, i.e., they aim to maximize their instantaneous rewards. For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair (π∗, ν∗) such that (i) π∗ is the optimal policy for the leader when the followers always play their best response, and (ii) ν∗ is the best response policy of the followers, which is a Nash equilibrium of the followers’ game induced by π∗. We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings. Our algorithms are optimistic and pessimistic variants of least-squares value iteration, and they are readily able to incorporate function approximation tools in the setting of large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively. To the best of our knowledge, we establish the first provably efficient RL algorithms for solving for SNEs in general-sum Markov games with myopic followers. Peking University. Email: hanzhong@stu.pku.edu.cn Princeton University. Email: zy6@princeton.edu Northwestern University. Email: zhaoranwang@gmail.com UC Berkeley. Email: jordan@cs.berkeley.edu

[1]  Yuandong Tian,et al.  Provably Efficient Policy Gradient Methods for Two-Player Zero-Sum Markov Games , 2021, ArXiv.

[2]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[3]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[4]  Stefano Coniglio,et al.  Methods for Finding Leader-Follower Equilibria with Multiple Followers: (Extended Abstract) , 2016, AAMAS.

[5]  Meixia Tao,et al.  Caching incentive design in wireless D2D networks: A Stackelberg game approach , 2016, 2016 IEEE International Conference on Communications (ICC).

[6]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[7]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[8]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[9]  Lin F. Yang,et al.  Minimax Sample Complexity for Turn-based Stochastic Game , 2020, UAI.

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[12]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[13]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[14]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[15]  David C. Parkes,et al.  The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies , 2020, ArXiv.

[16]  Huan Wang,et al.  Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games , 2021, NeurIPS.

[17]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[18]  Tiancheng Yu,et al.  Provably Efficient Online Agnostic Learning in Markov Games , 2020, ArXiv.

[19]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[20]  Quanquan Gu,et al.  Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation , 2021, ArXiv.

[21]  Tamer Basar,et al.  Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games , 2022, Dynamic Games and Applications.

[22]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[23]  Stefano Coniglio,et al.  Computing a Pessimistic Stackelberg Equilibrium with Multiple Followers: The Mixed-Pure Case , 2019, Algorithmica.

[24]  Pingzhong Tang,et al.  Learning Optimal Strategies to Commit To , 2019, AAAI.

[25]  T. Başar,et al.  Dynamic Noncooperative Game Theory , 1982 .

[26]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[27]  Vincent Conitzer,et al.  Computing the optimal strategy to commit to , 2006, EC '06.

[28]  Stefano Coniglio,et al.  Bilevel Programming Approaches to the Computation of Optimistic and Pessimistic Single-Leader-Multi-Follower Equilibria , 2017, SEA.

[29]  Jose B. Cruz,et al.  Stackelberg strategies and incentives in multiperson deterministic decision problems , 1984, IEEE Transactions on Systems, Man, and Cybernetics.

[30]  Milind Tambe,et al.  Security and Game Theory - Algorithms, Deployed Systems, Lessons Learned , 2011 .

[31]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[32]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[33]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[34]  Fernando Ordóñez,et al.  Stationary Strong Stackelberg Equilibrium in Discounted Stochastic Games , 2019, IEEE Transactions on Automatic Control.

[35]  Chi Jin,et al.  V-Learning - A Simple, Efficient, Decentralized Algorithm for Multiagent RL , 2021, ArXiv.

[36]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[37]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[38]  Y. Narahari,et al.  Design of Incentive Compatible Mechanisms for Stackelberg Problems , 2005, WINE.

[39]  Mengdi Wang,et al.  Feature-Based Q-Learning for Two-Player Stochastic Games , 2019, ArXiv.

[40]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[41]  Agnieszka Wiszniewska-Matyszkiel,et al.  Dynamic Stackelberg duopoly with sticky prices and a myopic follower , 2021, Operational Research.

[42]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[43]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[44]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[45]  Marc G. Bellemare,et al.  The Importance of Pessimism in Fixed-Dataset Policy Optimization , 2020, ArXiv.

[46]  Ming Jin,et al.  Social game for building energy efficiency: Incentive design , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[47]  Banghua Zhu,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[48]  Vikash Kumar,et al.  A Game Theoretic Framework for Model Based Reinforcement Learning , 2020, ICML.

[49]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[50]  Bernhard von Stengel,et al.  Leadership games with convex strategy sets , 2010, Games Econ. Behav..

[51]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[52]  Afshin Oroojlooyjadid,et al.  A review of cooperative multi-agent deep reinforcement learning , 2019, Applied Intelligence.

[53]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[54]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[55]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[56]  Qiaomin Xie,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT 2020.

[57]  Zhaoran Wang,et al.  A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , 2020, ArXiv.

[58]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[59]  Fernando Ordóñez,et al.  On the Value Iteration method for dynamic Strong Stackelberg Equilibria , 2019 .

[60]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[61]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[62]  Lillian J. Ratliff,et al.  Adaptive Incentive Design , 2018, IEEE Transactions on Automatic Control.

[63]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[64]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[65]  A Single-Timescale Stochastic Bilevel Optimization Method , 2021, ArXiv.

[66]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[67]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[68]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[69]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[70]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[71]  Lillian J. Ratliff,et al.  Convergence of Learning Dynamics in Stackelberg Games , 2019, ArXiv.

[72]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[73]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[74]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[75]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[76]  Vincent Conitzer,et al.  Complexity of Mechanism Design , 2002, UAI.

[77]  A. Haurie,et al.  Sequential Stackelberg equilibria in two-person games , 1985 .

[78]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[79]  Nan Jiang,et al.  $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[80]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[81]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[82]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[83]  Ariel D. Procaccia,et al.  Learning Optimal Commitment to Overcome Insecurity , 2014, NIPS.

[84]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[85]  Yongdong Wu,et al.  Incentive Mechanism Design for Heterogeneous Peer-to-Peer Networks: A Stackelberg Game Approach , 2014, IEEE Transactions on Mobile Computing.

[86]  Suresh P. Sethi,et al.  A review of dynamic Stackelberg game models , 2016 .

[87]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[88]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[89]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[90]  Pierre Baldi,et al.  Solving the Rubik’s cube with deep reinforcement learning and search , 2019, Nature Machine Intelligence.

[91]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.

[92]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.