A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit

Adaptive and sequential experiment design is a well-studied area in numerous domains. We survey and synthesize the work of the online statistical learning paradigm referred to as multi-armed bandits integrating the existing research as a resource for a certain class of online experiments. We first explore the traditional stochastic model of a multi-armed bandit, then explore a taxonomic scheme of complications to that model, for each complication relating it to a specific requirement or consideration of the experiment design context. Finally, at the end of the paper, we present a table of known upper-bounds of regret for all studied algorithms providing both perspectives for future theoretical work and a decision-making tool for practitioners looking for theoretical guarantees.

[1]  Vaibhav Srivastava,et al.  On optimal foraging and multi-armed bandits , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[2]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[3]  Raphaël Féraud,et al.  A Neural Networks Committee for the Contextual Bandit Problem , 2014, ICONIP.

[4]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[5]  Nicolò Cesa-Bianchi,et al.  On-line learning with malicious noise and the closure algorithm , 1994, Annals of Mathematics and Artificial Intelligence.

[6]  M. A. Girshick,et al.  Bayes and minimax solutions of sequential decision problems , 1949 .

[7]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[8]  David Tolusso,et al.  Some Properties of the Randomized Play the Winner Rule , 2012 .

[9]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[10]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[11]  Joseph P. Romano,et al.  On the uniform asymptotic validity of subsampling and the bootstrap , 2012, 1204.2762.

[12]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[13]  Maurits Kaptein,et al.  The use of Thompson sampling to increase estimation precision , 2014, Behavior Research Methods.

[14]  R. E. Morin,et al.  Factors influencing rate and extent of learning in the presence of mis-informative feedback. , 1955, Journal of experimental psychology.

[15]  T L Lai,et al.  Sequential medical trials. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[16]  F. Knight The economic nature of the firm: From Risk, Uncertainty, and Profit , 2009 .

[17]  Michèle Basseville,et al.  Detecting changes in signals and systems - A survey , 1988, Autom..

[18]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[19]  Alexandre Proutière,et al.  Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[20]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[21]  Hemant Tyagi,et al.  Continuum Armed Bandit Problem of Few Variables in High Dimensions , 2013, WAOA.

[22]  Aditya Mahajan,et al.  Multi‐Armed Bandits, Gittins Index, and its Calculation , 2014 .

[23]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[24]  Li Zhou,et al.  A Survey on Contextual Multi-armed Bandits , 2015, ArXiv.

[25]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[26]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[27]  Shie Mannor,et al.  Sub-sampling for Multi-armed Bandits , 2014, ECML/PKDD.

[28]  S. Young,et al.  On adjusting P-values for multiplicity. Response , 1993 .

[29]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[30]  Romain Laroche,et al.  Contextual Bandit for Active Learning: Active Thompson Sampling , 2014, ICONIP.

[31]  Matthew W. Hoffman,et al.  An Entropy Search Portfolio for Bayesian Optimization , 2014, ArXiv.

[32]  R. Lipshitz,et al.  Coping with Uncertainty: A Naturalistic Decision-Making Analysis , 1997 .

[33]  Jonathan L. Shapiro,et al.  Thompson Sampling in Switching Environments with Bayesian Online Change Detection , 2013, AISTATS.

[34]  Yisong Yue,et al.  Hierarchical Exploration for Accelerating Contextual Bandits , 2012, ICML.

[35]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[36]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[37]  Peter Vrancx,et al.  Multi-objective χ-Armed bandits , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[38]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[39]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[40]  Jonathan L. Shapiro,et al.  Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection , 2013, AISTATS 2013.

[41]  Alessandro Lazaric,et al.  Hybrid Stochastic-Adversarial On-line Learning , 2009, COLT.

[42]  Craig Boutilier,et al.  Learning and planning in structured worlds , 2000 .

[43]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[44]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[45]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[46]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[47]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[48]  Michèle Sebag,et al.  Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[49]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[50]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[51]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[52]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[53]  D. Kahneman,et al.  Conditions for intuitive expertise: a failure to disagree. , 2009, The American psychologist.

[54]  Christopher Jennison,et al.  Statistical Approaches to Interim Monitoring of Medical Trials: A Review and Commentary , 1990 .

[55]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[56]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[57]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[58]  D. Kahneman A perspective on judgment and choice: mapping bounded rationality. , 2003, The American psychologist.

[59]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[60]  N E Day,et al.  Two-stage designs for clinical trials. , 1969, Biometrics.

[61]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[62]  Stian Berg,et al.  Solving dynamic bandit problems and decentralized games using the kalman bayesian learning automaton , 2010 .

[63]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[64]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[65]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[66]  P. Whittle Multi‐Armed Bandits and the Gittins Index , 1980 .

[67]  J. Gittins,et al.  A dynamic allocation index for the discounted multiarmed bandit problem , 1979 .

[68]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[69]  Nicolò Cesa-Bianchi,et al.  Finite-Time Regret Bounds for the Multiarmed Bandit Problem , 1998, ICML.

[70]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[71]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[72]  S. Panchapakesan,et al.  Inference about the Change-Point in a Sequence of Random Variables: A Selection Approach , 1988 .

[73]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[74]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[75]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[76]  Philippe Preux,et al.  Cold-start Problems in Recommendation Systems via Contextual-bandit Algorithms , 2014, ArXiv.

[77]  L. J. Wei,et al.  The Randomized Play-the-Winner Rule in Medical Trials , 1978 .

[78]  Ambuj Tewari,et al.  Efficient bandit algorithms for online multiclass prediction , 2008, ICML '08.

[79]  J. Tsitsiklis A short proof of the Gittins index theorem , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[80]  Nicolò Cesa-Bianchi,et al.  Online Learning with Switching Costs and Other Adaptive Adversaries , 2013, NIPS.

[81]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[82]  Hiroshi Nakagawa,et al.  Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[83]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[84]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[85]  Paul B. Reverdy Modeling Human Decision-making in Multi-armed Bandits , 2013 .

[86]  W F Rosenberger,et al.  Randomized play-the-winner clinical trials: review and recommendations. , 1999, Controlled clinical trials.

[87]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[88]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[89]  T. Colton A Model for Selecting One of Two Medical Treatments , 1963 .

[90]  Shein-Chung Chow,et al.  Adaptive design methods in clinical trials – a review , 2008, Orphanet journal of rare diseases.

[91]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[92]  J. Sarkar One-Armed Bandit Problems with Covariates , 1991 .

[93]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[94]  Akimichi Takemura,et al.  Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors , 2013, AISTATS.

[95]  T. Lai,et al.  Optimal learning and experimentation in bandit problems , 2000 .

[96]  David E. Bell,et al.  Disappointment in Decision Making Under Uncertainty , 1985, Oper. Res..

[97]  Chris Mesterharm,et al.  Experience-efficient learning in associative bandit problems , 2006, ICML.

[98]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[99]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[100]  Doina Precup,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[101]  N. Balakrishnan Methods and Applications of Statistics in Clinical Trials: Planning, Analysis, and Inferential Methods , 2014 .

[102]  V. Bentkus On Hoeffding’s inequalities , 2004, math/0410159.

[103]  Csaba Szepesvári,et al.  Adaptive Monte Carlo via Bandit Allocation , 2014, ICML.

[104]  Weng Kee Wong,et al.  Adaptive clinical trial designs for phase I cancer studies , 2014 .

[105]  Max Chevalier,et al.  A Multiple-Play Bandit Algorithm Applied to Recommender Systems , 2015, FLAIRS.

[106]  M. Keane,et al.  Decision-Making Under Uncertainty: Capturing Dynamic Brand Choice Processes in Turbulent Consumer Goods Markets , 1996 .

[107]  Sebastian U. Stich,et al.  On Two Continuum Armed Bandit Problems in High Dimensions , 2014, Theory of Computing Systems.

[108]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[109]  Jason L. Loeppky,et al.  Improving Online Marketing Experiments with Drifting Multi-armed Bandits , 2015, ICEIS.

[110]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[111]  Sudipto Guha,et al.  Stochastic Regret Minimization via Thompson Sampling , 2014, COLT.

[112]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[113]  Day Ne Two-stage designs for clinical trials. , 1969 .

[114]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[115]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[116]  Deepayan Chakrabarti,et al.  Bandits for Taxonomies: A Model-based Approach , 2007, SDM.

[117]  Mihir Bellare,et al.  Notes on Randomized Algorithms , 2014, ArXiv.

[118]  Judea Pearl,et al.  Heuristics : intelligent search strategies for computer problem solving , 1984 .

[119]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[120]  Matthew J. Streeter,et al.  Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[121]  Ole-Christoffer Granmo,et al.  Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters , 2010, IEA/AIE.

[122]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[123]  Ole-Christoffer Granmo,et al.  A Two-Armed Bandit Based Scheme for Accelerated Decentralized Learning , 2011, IEA/AIE.

[124]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[125]  Dean Eckles,et al.  Thompson sampling with the online bootstrap , 2014, ArXiv.

[126]  Gideon Weiss,et al.  Four proofs of Gittins’ multiarmed bandit theorem , 2016, Ann. Oper. Res..

[127]  Atsuyoshi Nakamura,et al.  Algorithms for Adversarial Bandit Problems with Multiple Plays , 2010, ALT.

[128]  David V. Hinkley,et al.  Inference about the change-point in a sequence of binomial variables , 1970 .

[129]  Raphaël Féraud,et al.  EXP3 with drift detection for the switching bandit problem , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[130]  Michal Valko,et al.  Simple regret for infinitely many armed bandits , 2015, ICML.

[131]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[132]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[133]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[134]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[135]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[136]  Peter R. Nelson,et al.  Multiple Comparisons: Theory and Methods , 1997 .

[137]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[138]  Aleksandrs Slivkins,et al.  25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[139]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[140]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[141]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.