Bandits with Delayed Anonymous Feedback

We study the bandits with delayed anonymous feedback problem, a variant of the stochastic K-armed bandit problem, in which the reward from each play of an arm is no longer obtained instantaneously but received after some stochastic delay. Furthermore, the learner is not told which arm an observation corresponds to, nor do they observe the delay associated with a play. Instead, at each time step, the learner selects an arm to play and receives a reward which could be from any combination of past plays. This is a very natural problem; however, due to the delay and anonymity of the observations, it is considerably harder than the standard bandit problem. Despite this, we demonstrate it is still possible to achieve logarithmic regret, but with additional lower order terms. In particular, we provide an algorithm with regret O(log(T ) + √ g(τ) log(T ) + g(τ)) where g(τ) is some function of the delay distribution. This is of the same order as that achieved in [9] for the simpler problem where the observations are not anonymous. We support our theoretical observation equating the two orders of regret with experiments.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  T. W. Anderson Sequential Analysis with Delayed Observations , 1964 .

[3]  鈴木 雪夫 On sequential decision problems with delayed observations = 時間おくれの逐次決定問題について , 1967 .

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[6]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[7]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[8]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[9]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[10]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[11]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[12]  Zoran Popovic,et al.  The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits , 2015, AAAI.

[13]  Tor Lattimore,et al.  On Explore-Then-Commit strategies , 2016, NIPS.

[14]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .