Nonstochastic Bandits with Composite Anonymous Feedback

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

[1]  Chris Mesterharm,et al.  On-line Learning with Delayed Label Feedback , 2005, ALT.

[2]  Shie Mannor,et al.  Online Learning for Adversaries with Memory: Price of Past Mistakes , 2015, NIPS.

[3]  Erik Ordentlich,et al.  On delayed prediction of individual sequences , 2002, IEEE Trans. Inf. Theory.

[4]  Elad Hazan,et al.  Interior-Point Methods for Full-Information and Bandit Online Learning , 2012, IEEE Transactions on Information Theory.

[5]  Csaba Szepesvári,et al.  Bandits with Delayed Anonymous Feedback , 2017, ArXiv.

[6]  András György,et al.  Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms , 2016, AAAI.

[7]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[8]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[9]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[10]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[11]  Elad Hazan,et al.  The Blinded Bandit: Learning with Adaptive Feedback , 2014, NIPS.

[12]  Nate Soares,et al.  Asymptotic Convergence in Online Learning with Unbounded Delays , 2016, ArXiv.

[13]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[14]  Ohad Shamir,et al.  On the Complexity of Bandit Linear Optimization , 2014, COLT.

[15]  Kent Quanrud,et al.  Online Learning with Adversarial Delays , 2015, NIPS.

[16]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[17]  Ambuj Tewari,et al.  Improved Regret Guarantees for Online Smooth Convex Optimization with Bandit Feedback , 2011, AISTATS.

[18]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[19]  Zoran Popovic,et al.  The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits , 2015, AAAI.

[20]  Claudio Gentile,et al.  Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[21]  Ohad Shamir,et al.  Online Learning with Local Permutations and Delayed Feedback , 2017, ICML.

[22]  Yuval Peres,et al.  Online Learning with Composite Loss Functions , 2014, COLT.

[23]  Kent Quanrud,et al.  Adversarial Delays in Online Strongly-Convex Optimization , 2016, ArXiv.

[24]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[25]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[26]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[27]  Vianney Perchet,et al.  Stochastic Bandit Models for Delayed Conversions , 2017, UAI.