Sequential Monte Carlo Bandits

In this paper we propose a flexible and efficient framework for handling multi-armed bandits, combining sequential Monte Carlo algorithms with hierarchical Bayesian modeling techniques. The framework naturally encompasses restless bandits, contextual bandits, and other bandit variants under a single inferential model. Despite the model's generality, we propose efficient Monte Carlo algorithms to make inference scalable, based on recent developments in sequential Monte Carlo methods. Through two simulation studies, the framework is shown to outperform other empirical methods, while also naturally scaling to more complex problems for which existing approaches can not cope. Additionally, we successfully apply our framework to online video-based advertising recommendation, and show its increased efficacy as compared to current state of the art bandit algorithms.

[1]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[2]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[3]  Nando de Freitas,et al.  The Unscented Particle Filter , 2000, NIPS.

[4]  Ole-Christoffer Granmo,et al.  Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton , 2010, Int. J. Intell. Comput. Cybern..

[5]  Michael D. Lee,et al.  Modeling Human Performance in Restless Bandits with Particle Filters , 2009, J. Probl. Solving.

[6]  P. Whittle Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[7]  Ziyun Wang,et al.  Predictive Adaptation of Hybrid Monte Carlo with Bayesian Parametric Bandits , 2011 .

[8]  H Robbins,et al.  A SEQUENTIAL DECISION PROBLEM WITH A FINITE MEMORY. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[10]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[11]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[12]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[13]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[14]  Yuhong Yang,et al.  RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES , 2002 .

[15]  Ashok K. Agrawala,et al.  Thompson Sampling for Dynamic Multi-armed Bandits , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[16]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18]  Simon J. Godsill,et al.  On sequential Monte Carlo sampling methods for Bayesian filtering , 2000, Stat. Comput..

[19]  Keith D. Kastella,et al.  Foundations and Applications of Sensor Management , 2010 .

[20]  Jun S. Liu,et al.  Sequential Monte Carlo methods for dynamic systems , 1997 .

[21]  David S. Leslie,et al.  Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[22]  P. Moral,et al.  Sequential Monte Carlo samplers , 2002, cond-mat/0212648.

[23]  A. Doucet,et al.  An efficient computational approach for prior sensitivity analysis and cross‐validation , 2010 .

[24]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[25]  Benedict C. May Simulation Studies in Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2011 .

[26]  W. R. Thompson On the Theory of Apportionment , 1935 .

[27]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[28]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[29]  Dimitris K. Tasoulis,et al.  Simulation Studies of Multi-armed Bandits with Covariates (Invited Paper) , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[30]  Rémi Munos,et al.  Particle Filter-based Policy Gradient in POMDPs , 2008, NIPS.

[31]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[32]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[33]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[34]  N. Chopin A sequential particle filter method for static models , 2002 .