Thompson sampling with the online bootstrap

Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson sampling by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution. We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case. Subsequently, we detail why BTS using the online bootstrap is more scalable than regular Thompson sampling, and we show through simulation that BTS is more robust to a misspecified error distribution. BTS is an appealing modification of Thompson sampling, especially when samples from the posterior are otherwise not available or are costly.

[1]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[2]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[3]  Michael D. Lee,et al.  Optimal experimental design for a class of bandit problems , 2010 .

[4]  Gary H. McClelland,et al.  Optimal design in psychological research. , 1997 .

[5]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[6]  Jay I. Myung,et al.  Optimal experimental design for model discrimination. , 2009, Psychological review.

[7]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[8]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[9]  Ohad Shamir,et al.  On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization , 2012, COLT.

[10]  Arnaud Doucet,et al.  SMC Samplers for Bayesian Optimal Nonlinear Design , 2006, 2006 IEEE Nonlinear Statistical Signal Processing Workshop.

[11]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[12]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[13]  Raul Cano On The Bayesian Bootstrap , 1992 .

[14]  Timothy E. O'Brien,et al.  A Gentle Introduction to Optimal Design for Regression Models , 2003 .

[15]  W. Dewey,et al.  Thermal dose determination in cancer therapy. , 1984, International journal of radiation oncology, biology, physics.

[16]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[17]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[18]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[19]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[20]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[21]  Jay I. Myung,et al.  A Tutorial on Adaptive Design Optimization. , 2013, Journal of mathematical psychology.

[22]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[23]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[24]  P. Whittle Multi‐Armed Bandits and the Gittins Index , 1980 .

[25]  Lin Xiao,et al.  Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback. , 2010, COLT 2010.

[26]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[27]  Herbert K. H. Lee,et al.  Lossless Online Bayesian Bagging , 2004, J. Mach. Learn. Res..

[28]  D. Freedman Bootstrapping Regression Models , 1981 .

[29]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[30]  A. Owen,et al.  Bootstrapping data arrays of arbitrary order , 2011, 1106.2125.

[31]  Bradley Efron,et al.  Bayesian inference and the parametric bootstrap. , 2012, The annals of applied statistics.

[32]  Philip S. Yu,et al.  Mining Data Streams , 2005, The Data Mining and Knowledge Discovery Handbook.

[33]  Sham M. Kakade,et al.  Stochastic Convex Optimization with Bandit Feedback , 2011, SIAM J. Optim..

[34]  Theodore T. Allen,et al.  An experimental design criterion for minimizing meta‐model prediction errors applied to die casting process design , 2003 .

[35]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[36]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[37]  William G. Bardsley,et al.  Optimal Design: A Computer Program to Study the Best Possible Spacing of Design Points for Model Discrimination , 1996, Comput. Chem..

[38]  Ian C Marschner,et al.  Optimal design of clinical trials comparing several treatments with a control , 2007, Pharmaceutical statistics.

[39]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[40]  Lorenzo Rosasco,et al.  Online Learning, Stability, and Stochastic Gradient Descent , 2011, ArXiv.

[41]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[42]  Kent B. Monroe,et al.  Pricing on the Internet , 2002 .

[43]  Yisong Yue,et al.  Hierarchical Exploration for Accelerating Contextual Bandits , 2012, ICML.

[44]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[45]  Thomas Lumley,et al.  Model-Robust Regression and a Bayesian `Sandwich' Estimator , 2010, 1101.1402.

[46]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..