Contextual Bandits under Delayed Feedback

Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as online marketing and recommendation. One of the main challenges faced by practitioners hoping to apply existing algorithms is that usually the feedback is randomly delayed and delays are only partially observable. For example, while a purchase is usually observable some time after the display, the decision of not buying is never explicitly sent to the system. In other words, the learner only observes delayed positive events. We formalize this problem as a novel stochastic delayed linear bandit and propose ${\tt OTFLinUCB}$ and ${\tt OTFLinTS}$, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. We prove optimal $\tilde O(\smash{d\sqrt{T}})$ bounds on the regret of the first algorithm and study the dependency on delay-dependent parameters. Our model, assumptions and results are validated by experiments on simulated and real data.

[1]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[2]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[3]  Olivier Chapelle,et al.  Modeling delayed feedback in display advertising , 2014, KDD.

[4]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[5]  Claudio Gentile,et al.  Nonstochastic Bandits with Composite Anonymous Feedback , 2018, COLT.

[6]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[7]  Tor Lattimore,et al.  Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , 2018, ICML.

[8]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[9]  András György,et al.  Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems , 2018, ICML.

[10]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[11]  Li Zhou,et al.  A Survey on Contextual Multi-armed Bandits , 2015, ArXiv.

[12]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[13]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[14]  Richard Combes,et al.  Stochastic Online Shortest Path Routing: The Value of Feedback , 2013, IEEE Transactions on Automatic Control.

[15]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[16]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[17]  Vianney Perchet,et al.  Stochastic Bandit Models for Delayed Conversions , 2017, UAI.

[18]  Stephen G. Eick,et al.  The two-armed bandit with delayed responses , 1988 .

[19]  Travis Mandel,et al.  Towards More Practical Reinforcement Learning , 2015, IJCAI.

[20]  Eustache Diemert,et al.  Attribution Modeling Increases Efficiency of Bidding in Display Advertising , 2017, ADKDD@KDD.

[21]  András György,et al.  Learning from Delayed Outcomes with Intermediate Observations , 2018, ArXiv.

[22]  Yuhong Yang,et al.  Randomized Allocation with Nonparametric Estimation for Contextual Multi-Armed Bandits with Delayed Rewards , 2019, Statistics & Probability Letters.

[23]  Robert D. Nowak,et al.  Scalable Generalized Linear Bandits: Online Computation and Hashing , 2017, NIPS.

[24]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[25]  Stefano Ermon,et al.  Best arm identification in multi-armed bandits with delayed feedback , 2018, AISTATS.

[26]  Yuya Yoshikawa,et al.  A Nonparametric Delayed Feedback Model for Conversion Rate Prediction , 2018, ArXiv.

[27]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[28]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[29]  Csaba Szepesvári,et al.  Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.

[30]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[31]  Csaba Szepesvári,et al.  Bandits with Delayed Anonymous Feedback , 2017, ArXiv.

[32]  Georgios B. Giannakis,et al.  Bandit Online Learning with Unknown Delays , 2018, AISTATS.