Explore/Exploit Schemes for Web Content Optimization

We propose novel multi-armed bandit (explore/exploit) schemes to maximize total clicks on a content module published regularly on Yahoo! Intuitively, one can ``explore'' each candidate item by displaying it to a small fraction of user visits to estimate the item's click-through rate (CTR), and then ``exploit'' high CTR items in order to maximize clicks. While bandit methods that seek to find the optimal trade-off between explore and exploit have been studied for decades, existing solutions are not satisfactory for web content publishing applications where dynamic set of items with short lifetimes, delayed feedback and non-stationary reward (CTR) distributions are typical. In this paper, we develop a Bayesian solution and extend several existing schemes to our setting. Through extensive evaluation with nine bandit schemes, we show that our Bayesian solution is uniformly better in several scenarios. We also study the empirical characteristics of our schemes and provide useful insights on the strengths and weaknesses of each. Finally, we validate our results with a ``side-by-side'' comparison of schemes through live experiments conducted on a random sample of real user visits to Yahoo!

[1]  R. Weber,et al.  On an index policy for restless bandits , 1990, Journal of Applied Probability.

[2]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[3]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..

[4]  J. Sarkar One-Armed Bandit Problems with Covariates , 1991 .

[5]  M. Degroot Optimal Statistical Decisions , 1970 .

[6]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[7]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[8]  Filip Radlinski,et al.  Mortal Multi-Armed Bandits , 2008, NIPS.

[9]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[10]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[11]  John R. Hauser,et al.  Website Morphing , 2009, Mark. Sci..

[12]  K. Glazebrook,et al.  Index-based policies for discounted multi-armed bandits on parallel machines , 2000 .

[13]  John N. Tsitsiklis,et al.  The complexity of optimal queueing network control , 1994, Proceedings of IEEE 9th Annual Conference on Structure in Complexity Theory.

[14]  Stephen G. Eick,et al.  Gittins procedures for bandits with delayed responses , 1988 .

[15]  Morris H. DeGroot,et al.  Optimal Statistical Decisions: DeGroot/Statistical Decisions WCL , 2005 .

[16]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[17]  Dimitris Bertsimas,et al.  Restless Bandits, Linear Programming Relaxations, and a Primal-Dual Index Heuristic , 2000, Oper. Res..

[18]  R. R. Lumley,et al.  On the optimal allocation of service to impatient tasks , 2004, Journal of Applied Probability.

[19]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[20]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[21]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[22]  Deepayan Chakrabarti,et al.  Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[23]  Deepak Agarwal,et al.  Spatio-temporal models for estimating click-through rate , 2009, WWW '09.