Logarithmic regret bounds for Bandits with Knapsacks

Optimal regret bounds for Multi-Armed Bandit problems are now well documented. They can be classified into two categories based on the growth rate with respect to the time horizon $T$: (i) small, distribution-dependent, bounds of order of magnitude $\ln(T)$ and (ii) robust, distribution-free, bounds of order of magnitude $\sqrt{T}$. The Bandits with Knapsacks model, an extension to the framework allowing to model resource consumption, lacks this clear-cut distinction. While several algorithms have been shown to achieve asymptotically optimal distribution-free bounds on regret, there has been little progress toward the development of small distribution-dependent regret bounds. We partially bridge the gap by designing a general-purpose algorithm with distribution-dependent regret bounds that are logarithmic in the initial endowments of resources in several important cases that cover many practical applications, including dynamic pricing with limited supply, bid optimization in online advertisement auctions, and dynamic procurement.

[1]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[2]  Omar Besbes,et al.  Blind Network Revenue Management , 2011, Oper. Res..

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Nikhil R. Devanur,et al.  An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.

[5]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[6]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[7]  Aleksandrs Slivkins,et al.  Dynamic Ad Allocation: Bandits with Budgets , 2013, ArXiv.

[8]  Deepak S. Turaga,et al.  Budgeted Prediction with Expert Advice , 2015, AAAI.

[9]  Nicholas R. Jennings,et al.  Efficient Regret Bounds for Online Bid Optimisation in Budget-Limited Sponsored Search Auctions , 2014, UAI.

[10]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[11]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[12]  D. Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017 .

[13]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[14]  Peter Jacko,et al.  Generalized Restless Bandits and the Knapsack Problem for Perishable Inventories , 2014, Oper. Res..

[15]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[16]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[17]  L. Lovász,et al.  Geometric Algorithms and Combinatorial Optimization , 1981 .

[18]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[19]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[20]  Moshe Babaioff,et al.  Dynamic Pricing with Limited Supply , 2011, ACM Trans. Economics and Comput..

[21]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[22]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[23]  Vijay Kumar,et al.  Online learning in online auctions , 2003, SODA '03.

[24]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[25]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[26]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[27]  Archie C. Chapman,et al.  ε-first policies for budget-limited multi-armed bandits , 2010, AAAI 2010.

[28]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[29]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[30]  Marcello Restelli,et al.  Budgeted Multi-Armed Bandit in Continuous Action Space , 2016, ECAI.

[31]  R. Srikant,et al.  Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[32]  Robert D. Kleinberg,et al.  Learning on a budget: posted price mechanisms for online procurement , 2012, EC '12.

[33]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[34]  Nenghai Yu,et al.  Budgeted Bandit Problems with Continuous Random Costs , 2015, ACML.

[35]  Nicholas R. Jennings,et al.  Efficient Crowdsourcing of Unknown Experts using Multi-Armed Bandits , 2012, ECAI.

[36]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[37]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[38]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[39]  John Langford,et al.  Resourceful Contextual Bandits , 2014, COLT.

[40]  Archie C. Chapman,et al.  ǫ – First Policies for Budget – Limited Multi-Armed Bandits Long , 2010 .

[41]  Sergei Vassilvitskii,et al.  WWW 2009 MADRID! Track: Internet Monetization / Session: Web Monetization Adaptive Bidding for Display Advertising ABSTRACT , 2022 .

[42]  Nicholas R. Jennings,et al.  Long-term information collection with energy harvesting wireless sensors: a multi-armed bandit based approach , 2012, Autonomous Agents and Multi-Agent Systems.

[43]  Vianney Perchet,et al.  Online learning in repeated auctions , 2015, COLT.