Online Learning with Switching Costs and Other Adaptive Adversaries

We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player's performance using a new notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, the attainable rate with bandit feedback is Θ(T2/3). Interestingly, this rate is significantly worse than the Θ(√T) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also show that a bounded memory adversary can force ****Θ(T2/3) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies.

[1]  D. Teneketzis,et al.  Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost , 1988 .

[2]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[5]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[6]  Neri Merhav,et al.  On sequential strategies for loss functions with memory , 2002, IEEE Trans. Inf. Theory.

[7]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[8]  DE Economist A SURVEY ON THE BANDIT PROBLEM WITH SWITCHING COSTS , 2004 .

[9]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[10]  Chris Mesterharm,et al.  On-line Learning with Delayed Label Feedback , 2005, ALT.

[11]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[12]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[13]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[14]  Jacob D. Abernethy,et al.  Beating the adaptive bandit with high probability , 2009, 2009 Information Theory and Applications Workshop.

[15]  Ronald Ortner,et al.  Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[16]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[17]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[18]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[19]  Ohad Shamir,et al.  On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization , 2012, COLT.

[20]  András György,et al.  Near-Optimal Rates for Limited-Delay Universal Lossy Source Coding , 2014, IEEE Transactions on Information Theory.

[21]  Claudio Gentile,et al.  Regret Minimization for Reserve Prices in Second-Price Auctions , 2013, IEEE Transactions on Information Theory.