Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

A multi-armed bandit problem - or, simply, a bandit problem - is a sequential allocation problem defined by a set of actions. At each time step, a unit resource is allocated to an action and some observable payoff is obtained. The goal is to maximize the total payoff obtained in a sequence of allocations. The name bandit refers to the colloquial term for a slot machine (a "one-armed bandit" in American slang). In a casino, a sequential allocation problem is obtained when the player is facing many slot machines at once (a "multi-armed bandit"), and must repeatedly choose where to insert the next coin. Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this book, the focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, it also analyzes some of the most important variants and extensions, such as the contextual bandit model. This monograph is an ideal reference for students and researchers with an interest in bandit problems.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  M. A. Girshick,et al.  Bayes and minimax solutions of sequential decision problems , 1949 .

[3]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[4]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[5]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[6]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[7]  A. Banos On Pseudo-Games , 1968 .

[8]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[9]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[10]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[11]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[12]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[13]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[14]  Robert W. Chen,et al.  Bandit problems with infinitely many arms , 1997 .

[15]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[16]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[17]  David P. Helmbold,et al.  Some label efficient learning results , 1997, COLT '97.

[18]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[19]  K. Ball An Elementary Introduction to Modern Convex Geometry , 1997 .

[20]  P. Lezaud Chernoff-type bound for finite Markov chains , 1998 .

[21]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[22]  Andreu Mas-Colell,et al.  A General Class of Adaptive Strategies , 1999, J. Econ. Theory.

[23]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[24]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[25]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[26]  Manfred K. Warmuth,et al.  Path Kernels and Multiplicative Updates , 2002, J. Mach. Learn. Res..

[27]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[28]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[29]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[30]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[31]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[32]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[33]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[34]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[35]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[36]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[37]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[38]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[39]  J. Hiriart-Urruty,et al.  Fundamentals of Convex Analysis , 2004 .

[40]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[41]  Eric Deeson,et al.  Online learning , 2005, Br. J. Educ. Technol..

[42]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[43]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[44]  Alexander V. Nazin,et al.  Recursive Aggregation of Estimators by the Mirror Descent Algorithm with Averaging , 2005, Probl. Inf. Transm..

[45]  Gábor Lugosi,et al.  Minimizing regret with label efficient prediction , 2004, IEEE Transactions on Information Theory.

[46]  Sanjeev R. Kulkarni,et al.  Arbitrary side observations in bandit problems , 2005, Adv. Appl. Math..

[47]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[48]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[49]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[50]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[51]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[52]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[53]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[54]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[55]  Nimrod Megiddo,et al.  Combining expert advice in reactive environments , 2006, JACM.

[56]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[57]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[58]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[59]  H. Robbins A Stochastic Approximation Method , 1951 .

[60]  András György,et al.  Continuous Time Associative Bandit Problems , 2007, IJCAI.

[61]  Manfred K. Warmuth,et al.  Learning Permutations with Exponential Weights , 2007, COLT.

[62]  Tamás Linder,et al.  The On-Line Shortest Path Problem Under Partial Monitoring , 2007, J. Mach. Learn. Res..

[63]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[64]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[65]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[66]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[67]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[68]  Ambuj Tewari,et al.  Efficient bandit algorithms for online multiclass prediction , 2008, ICML '08.

[69]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[70]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[71]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[72]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[73]  Varun Grover,et al.  Active Learning in Multi-armed Bandits , 2008, ALT.

[74]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[75]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[76]  Manfred K. Warmuth,et al.  Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension , 2008 .

[77]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[78]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[79]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[80]  Thomas P. Hayes,et al.  High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[81]  Olivier Teytaud,et al.  Creating an Upper-Confidence-Tree Program for Havannah , 2009, ACG.

[82]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[83]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[84]  Matthew J. Streeter,et al.  Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[85]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2009, EC '09.

[86]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[87]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[88]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[89]  Jacob D. Abernethy,et al.  Beating the adaptive bandit with high probability , 2009, 2009 Information Theory and Applications Workshop.

[90]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[91]  Varun Kanade,et al.  Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards , 2009, AISTATS.

[92]  Nikhil R. Devanur,et al.  The price of truthfulness for pay-per-click auctions , 2009, EC '09.

[93]  Eric W. Cope,et al.  Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems , 2009, IEEE Transactions on Automatic Control.

[94]  Moshe Babaioff,et al.  Truthful mechanisms with implicit payment computation , 2010, EC '10.

[95]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[96]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[97]  B. Kégl,et al.  Fast boosting using adversarial bandits , 2010, ICML.

[98]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[99]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[100]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[101]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[102]  Lin Xiao,et al.  Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback. , 2010, COLT 2010.

[103]  Olivier Teytaud,et al.  Bandit-Based Genetic Programming , 2010, EuroGP.

[104]  Peter L. Bartlett,et al.  Optimal Allocation Strategies for the Dark Pool Problem , 2010, AISTATS.

[105]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[106]  John Shawe-Taylor,et al.  Regret Bounds for Gaussian Process Bandit Problems , 2010, AISTATS 2010.

[107]  Bhaskar Krishnamachari,et al.  Dynamic Multichannel Access With Imperfect Channel State Detection , 2010, IEEE Transactions on Signal Processing.

[108]  Robert E. Schapire,et al.  Non-Stochastic Bandit Slate Problems , 2010, NIPS.

[109]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[110]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[111]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[112]  Csaba Szepesvári,et al.  Toward a classification of finite partial-monitoring games , 2010, Theor. Comput. Sci..

[113]  Atsuyoshi Nakamura,et al.  Algorithms for Adversarial Bandit Problems with Multiple Plays , 2010, ALT.

[114]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[115]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[116]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[117]  Wouter M. Koolen,et al.  Hedging Structured Concepts , 2010, COLT.

[118]  John Shawe-Taylor,et al.  PAC-Bayesian Analysis of Contextual Bandits , 2011, NIPS.

[119]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[120]  Aurélien Garivier,et al.  Optimally Sensing a Single Channel Without Prior Information: The Tiling Algorithm and Regret Bounds , 2011, IEEE Journal of Selected Topics in Signal Processing.

[121]  Sham M. Kakade,et al.  Stochastic Convex Optimization with Bandit Feedback , 2011, SIAM J. Optim..

[122]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[123]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[124]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[125]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[126]  Elad Hazan,et al.  Better Algorithms for Benign Bandits , 2009, J. Mach. Learn. Res..

[127]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[128]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[129]  Sébastien Bubeck,et al.  Introduction to Online Optimization , 2011 .

[130]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[131]  Rémi Munos,et al.  Finite Time Analysis of Stratified Sampling for Monte Carlo , 2011, NIPS.

[132]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[133]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[134]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[135]  Elad Hazan,et al.  Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction , 2011, NIPS.

[136]  Alessandro Lazaric,et al.  Multi-Bandit Best Arm Identification , 2011, NIPS.

[137]  Umar Syed,et al.  Bandits, Query Learning, and the Haystack Dimension , 2011, COLT.

[138]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[139]  Shie Mannor,et al.  Unimodal Bandits , 2011, ICML.

[140]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.

[141]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[142]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[143]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[144]  Jean-Yves Audibert,et al.  Deviations of Stochastic Bandit Regret , 2011, ALT.

[145]  Alessandro Lazaric,et al.  Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits , 2011, ALT.

[146]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[147]  Elad Hazan The convex optimization approach to regret minimization , 2011 .

[148]  Shie Mannor,et al.  Committing Bandits , 2011, NIPS.

[149]  Peter L. Bartlett,et al.  Oracle inequalities for computationally budgeted model selection , 2011, COLT.

[150]  Ambuj Tewari,et al.  On the Universality of Online Mirror Descent , 2011, NIPS.

[151]  Koby Crammer,et al.  Multiclass classification with bandit feedback using adaptive regularization , 2012, Machine Learning.

[152]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[153]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[154]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[155]  Aleksandrs Slivkins,et al.  25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[156]  Ambuj Tewari,et al.  Regularization Techniques for Learning with Matrices , 2009, J. Mach. Learn. Res..

[157]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[158]  Dean P. Foster,et al.  No Internal Regret via Neighborhood Watch , 2011, AISTATS.

[159]  Damien Ernst,et al.  Optimal discovery with probabilistic expert advice , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[160]  Thomas Steinke,et al.  Learning hurdles for sleeping experts , 2012, ITCS '12.

[161]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[162]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[163]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[164]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[165]  Nicolò Cesa-Bianchi,et al.  Bandits With Heavy Tail , 2012, IEEE Transactions on Information Theory.

[166]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[167]  Peter Auer,et al.  Regret bounds for restless Markov bandits , 2012, Theor. Comput. Sci..

[168]  Wouter M. Koolen,et al.  Combining initial segments of lists , 2011, Theor. Comput. Sci..

[169]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[170]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .