Algorithmic Game Theory: Learning, Regret Minimization, and Equilibria

Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive to work each day, or repeated play of a game against an opponent with an unknown strategy. In this chapter we describe learning algorithms with strong guarantees for settings of this type, along with connections to game-theoretic equilibria when all players in a system are simultaneously adapting in such a manner. We begin by presenting algorithms for repeated play of a matrix game with the guarantee that against any opponent, they will perform nearly as well as the best fixed action in hindsight (also called the problem of combining expert advice or minimizing external regret). In a zero-sum game, such algorithms are guaranteed to approach or exceed the minimax value of the game, and even provide a simple proof of the minimax theorem. We then turn to algorithms that minimize an even stronger form of regret, known as internal or swap regret. We present a general reduction showing how to convert any algorithm for minimizing external regret to one that minimizes this stronger form of regret as well. Internal regret is important because when all players in a game minimize this stronger type of regret, the empirical distribution of play is known to converge to correlated equilibrium. The third part of this chapter explains a different reduction: how to convert from the full information setting in which the action chosen by the opponent is revealed after each time step, to the partial information (bandit) setting, where at each time step only the payoff of the selected action is observed (such as in routing), and still maintain a small external regret. Finally, we end by discussing routing games in the Wardrop model, where one can show that if all participants minimize their own external regret, then

[1]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[2]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[3]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Dean P. Foster,et al.  Calibrated Learning and Correlated Equilibrium , 1997 .

[6]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[7]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[8]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[9]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[10]  Dean P. Foster,et al.  Regret in the On-Line Decision Problem , 1999 .

[11]  Nicolò Cesa-Bianchi,et al.  Potential-Based Algorithms in Online Prediction and Game Theory , 2001, COLT/EuroCOLT.

[12]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[13]  Ehud Lehrer,et al.  A wide range no-regret theorem , 2003, Games Econ. Behav..

[14]  Baruch Awerbuch,et al.  Adapting to a reliable network path , 2003, PODC '03.

[15]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[16]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[17]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[18]  Nicolò Cesa-Bianchi,et al.  Potential-Based Algorithms in On-Line Prediction and Game Theory , 2003, Machine Learning.

[19]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[20]  Gábor Lugosi,et al.  Internal Regret in On-Line Portfolio Selection , 2005, Machine Learning.

[21]  Nicolò Cesa-Bianchi,et al.  Regret Minimization Under Partial Monitoring , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[22]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[23]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[24]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[25]  Avrim Blum,et al.  Routing without regret: on convergence to nash equilibria of regret-minimizing algorithms in routing games , 2006, PODC '06.

[26]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..