论文信息 - Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

We consider "online bandit geometric optimization," a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize "regret," defined as the difference between the actual loss of an online decision-making procedure and that of the best single decision in hindsight. "Geometric optimization" refers to a generalization of the well-known multi-armed bandit problem, in which the decision space is some bounded subset of Rd, the adversary is restricted to linear loss functions, and regret bounds should depend on the dimensionality d, rather than the total number of possible decisions. "Bandit" refers to the setting in which the algorithm is only told its loss on each round, rather than the entire loss function.McMahan and Blum [10] presented the best known algorithm in this setting, and proved that its expected additive regret is O(poly(d)T3/4). We simplify and improve their analysis of this algorithm to obtain regret O(poly(d)T2/3).We also prove that, for a large class of full-information online optimization problems, the optimal regret against an adaptive adversary is the same as against a non-adaptive adversary.

Thomas P. Hayes | Varsha Dani | Varsha Dani

[1] James Hannan,et al. 4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[2] Manfred K. Warmuth,et al. The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[3] Vladimir Vovk,et al. Aggregating strategies , 1990, COLT '90.

[4] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[5] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7] Y. Freund,et al. The non-stochastic multi-armed bandit problem , 2001 .

[8] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[9] Santosh S. Vempala,et al. Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[10] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[11] Baruch Awerbuch,et al. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[12] Avrim Blum,et al. Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[13] J. M. Bilbao,et al. Contributions to the Theory of Games , 2005 .

[14] H. Robbins. Some aspects of the sequential design of experiments , 1952 .