论文信息 - Bandits with switching costs: T2/3 regret

Bandits with switching costs: T2/3 regret

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with full information feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of [EQUATION]. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is [EQUATION]. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.

[1] Berthold Vöcking,et al. Regret Minimization for Online Buffering Problems Using the Weighted Majority Algorithm , 2010, Electron. Colloquium Comput. Complex..

[2] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3] Paul D. Seymour,et al. Graphs with small bandwidth and cutwidth , 1989, Discret. Math..

[4] Ambuj Tewari,et al. Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[5] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[6] András György,et al. Near-Optimal Rates for Limited-Delay Universal Lossy Source Coding , 2014, IEEE Transactions on Information Theory.

[7] Nicolò Cesa-Bianchi,et al. Online Learning with Switching Costs and Other Adaptive Adversaries , 2013, NIPS.

[8] David Haussler,et al. How to use expert advice , 1993, STOC.

[9] Santosh S. Vempala,et al. Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[10] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[11] Andrew Chi-Chih Yao,et al. Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[12] Gergely Neu,et al. Near-Optimal Rates for Limited-Delay Universal Lossy Source Coding , 2014, IEEE Trans. Inf. Theory.

[13] Elad Hazan,et al. Better Rates for Any Adversarial Deterministic MDP , 2013, ICML.

[14] Jean-Yves Audibert,et al. Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[15] Manfred K. Warmuth,et al. The Weighted Majority Algorithm , 1994, Inf. Comput..

[16] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[17] Thomas M. Cover,et al. Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[18] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[19] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.