In the classical stochastic k-armed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algorithm and the optimal cost. In the linear optimization version of this problem (first considered by Auer [2002]), we view the arms as vectors in R, and require that the costs be linear functions of the chosen vector. As before, it is assumed that the cost functions are sampled independently from an unknown distribution. In this setting, the goal is to find algorithms whose running time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T ). In general though, the optimal regret is Θ∗( √ T ) — our lower bounds rule out the possibility of obtaining polylog(T ) rates in general. We present two variants of an algorithm based on the idea of “upper confidence bounds.” The first, due to Auer [2002], but not fully analyzed, obtains regret whose dependence on n and T are both essentially optimal, but which may be computationally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an intersection of half-spaces), but gives up a factor of √ n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time. ∗Department of Computer Science, University of Chicago, varsha@cs.uchicago.edu †Toyota Technological Institute at Chicago, {hayest,sham}@tti-c.org
[1]
C. McDiarmid.
Concentration
,
1862,
The Dental register.
[2]
Sartaj Sahni,et al.
Computationally Related Problems
,
1974,
SIAM J. Comput..
[3]
D. Freedman.
On Tail Probabilities for Martingales
,
1975
.
[4]
R. Varga,et al.
Proof of Theorem 4
,
1983
.
[5]
P. W. Jones,et al.
Bandit Problems, Sequential Allocation of Experiments
,
1987
.
[6]
R. Agrawal.
Sample mean based index policies by O(log n) regret for the multi-armed bandit problem
,
1995,
Advances in Applied Probability.
[7]
M. Habib.
Probabilistic methods for algorithmic discrete mathematics
,
1998
.
[8]
Philip M. Long,et al.
Associative Reinforcement Learning using Linear Probabilistic Concepts
,
1999,
ICML.
[9]
Peter Auer,et al.
Using Confidence Bounds for Exploitation-Exploration Trade-offs
,
2003,
J. Mach. Learn. Res..
[10]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[11]
Baruch Awerbuch,et al.
Adaptive routing with end-to-end feedback: distributed learning and geometric approaches
,
2004,
STOC '04.
[12]
H. Robbins.
Some aspects of the sequential design of experiments
,
1952
.
[13]
Thomas P. Hayes,et al.
The Price of Bandit Information for Online Optimization
,
2007,
NIPS.
[14]
Botond Cseke,et al.
Advances in Neural Information Processing Systems 20 (NIPS 2007)
,
2008
.
[15]
T. L. Lai Andherbertrobbins.
Asymptotically Efficient Adaptive Allocation Rules
,
2022
.