Near-Optimal Online Egalitarian learning in General Sum Repeated Matrix Games

We study two-player general sum repeated finite games where the rewards of each player are generated from an unknown distribution. Our aim is to find the egalitarian bargaining solution (EBS) for the repeated game, which can lead to much higher rewards than the maximin value of both players. Our most important contribution is the derivation of an algorithm that achieves simultaneously, for both players, a high-probability regret bound of order $\mathcal{O}(\sqrt[3]{\ln T}\cdot T^{2/3})$ after any $T$ rounds of play. We demonstrate that our upper bound is nearly optimal by proving a lower bound of $\Omega(T^{2/3})$ for any algorithm.

[1]  Ann Nowé,et al.  Designing multi-objective multi-armed bandits algorithms: A study , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[2]  Peter Stone,et al.  A polynomial-time nash equilibrium algorithm for repeated games , 2003, EC '03.

[3]  J. Nash THE BARGAINING PROBLEM , 1950, Classics in Game Theory.

[4]  Michael A. Goodrich,et al.  Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining , 2003, ICML.

[5]  Yoav Shoham,et al.  Learning against opponents with bounded memory , 2005, IJCAI.

[6]  Peter Stone,et al.  Multiagent learning in the presence of memory-bounded agents , 2013, Autonomous Agents and Multi-Agent Systems.

[7]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[8]  J. Nash NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[9]  Sarah Filippi,et al.  Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[10]  Yoav Shoham,et al.  A general criterion and an algorithmic framework for learning in multi-agent systems , 2007, Machine Learning.

[11]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[12]  Ilan Adler The equivalence of linear programs and zero-sum games , 2013, Int. J. Game Theory.

[13]  Michael L. Littman,et al.  A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games , 2008, UAI.

[14]  H. Imai Individual Monotonicity and Lexicographic Maxmin Solution , 1983 .

[15]  Dan W. Brockt,et al.  The Theory of Justice , 2017 .

[16]  E. Kalai Proportional Solutions to Bargaining Situations: Interpersonal Utility Comparisons , 1977 .

[17]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[18]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[19]  Walter Bossert,et al.  An arbitration game and the egalitarian bargaining solution , 1995 .

[20]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[21]  Michael A. Goodrich,et al.  Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning , 2011, Machine Learning.

[22]  Bikramjit Banerjee,et al.  Performance Bounded Reinforcement Learning in Strategic Interactions , 2004, AAAI.

[23]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[24]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[25]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[26]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .