Gambling in a rigged casino: The adversarial multi-armed bandit problem

In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T/sup -1/3/), and we give an improved rate of convergence when the best arm has fairly low payoff. We also consider a setting in which the player has a team of "experts" advising him on which arm to play; here, we give a strategy that will guarantee expected payoff close to that of the best expert. Finally, we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary.

[1]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[2]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[3]  Journal of the Association for Computing Machinery , 1961, Nature.

[4]  A. Banos On Pseudo-Games , 1968 .

[5]  J. Neveu,et al.  Discrete Parameter Martingales , 1975 .

[6]  N. Megiddo On repeated games with incomplete information played by non-Bayesian players , 1980 .

[7]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[8]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[9]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[12]  Dean P. Foster,et al.  A Randomization Rule for Selecting Forecasts , 1993, Oper. Res..

[13]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[14]  P. Varaiya,et al.  Multi-Armed bandit problem revisited , 1994 .

[15]  D. Fudenberg,et al.  Consistency and Cautious Fictitious Play , 1995 .

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[18]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .