Low regret bounds for Bandits with Knapsacks

Achievable regret bounds for Multi-Armed Bandit problems are now well-documented. They can be classified into two categories based on the dependence on the time horizon $T$: (1) small, distribution-dependent, bounds of order of magnitude $\ln(T)$ and (2) robust, distribution-free, bounds of order of magnitude $\sqrt{T}$. The Bandits with Knapsacks theory, an extension to the framework allowing to model resource consumption, lacks this duality. While several algorithms have been shown to yield asymptotically optimal distribution-free bounds on regret, there has been little progress toward the development of small distribution-dependent regret bounds. We partially bridge the gap by designing a general purpose algorithm which we show enjoy asymptotically optimal regret bounds in several cases that encompass many practical applications including dynamic pricing with limited supply and online bidding in ad auctions.

[1]  Sergei Vassilvitskii,et al.  WWW 2009 MADRID! Track: Internet Monetization / Session: Web Monetization Adaptive Bidding for Display Advertising ABSTRACT , 2022 .

[2]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[5]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[6]  Archie C. Chapman,et al.  ǫ – First Policies for Budget – Limited Multi-Armed Bandits Long , 2010 .

[7]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[8]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[9]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[10]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[11]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[12]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[13]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[14]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[15]  Nicholas R. Jennings,et al.  Efficient Crowdsourcing of Unknown Experts using Multi-Armed Bandits , 2012, ECAI.

[16]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[17]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[18]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[19]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[20]  Vijay Kumar,et al.  Online learning in online auctions , 2003, SODA '03.

[21]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[22]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[23]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[24]  Moshe Babaioff,et al.  Dynamic Pricing with Limited Supply , 2011, ACM Trans. Economics and Comput..

[25]  Nicholas R. Jennings,et al.  Efficient Regret Bounds for Online Bid Optimisation in Budget-Limited Sponsored Search Auctions , 2014, UAI.

[26]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[27]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[28]  Omar Besbes,et al.  Blind Network Revenue Management , 2011, Oper. Res..

[29]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[30]  D. Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017 .

[31]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[32]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[33]  Aleksandrs Slivkins,et al.  Dynamic Ad Allocation: Bandits with Budgets , 2013, ArXiv.