The Symmetry between Arms and Knapsacks: A Primal-Dual Approach for Bandits with Knapsacks

In this paper, we study the bandits with knapsacks (BwK) problem and develop a primal-dual based algorithm that achieves a problem-dependent logarithmic regret bound. The BwK problem extends the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm, and the existing BwK literature has been mainly focused on deriving asymptotically optimal distribution-free regret bounds. We first study the primal and dual linear programs underlying the BwK problem. From this primal-dual perspective, we discover symmetry between arms and knapsacks, and then propose a new notion of sub-optimality measure for the BwK problem. The sub-optimality measure highlights the important role of knapsacks in determining algorithm regret and inspires the design of our two-phase algorithm. In the first phase, the algorithm identifies the optimal arms and the binding knapsacks, and in the second phase, it exhausts the binding knapsacks via playing the optimal arms through an adaptive procedure. Our regret upper bound involves the proposed sub-optimality measure and it has a logarithmic dependence on length of horizon $T$ and a polynomial dependence on $m$ (the numbers of arms) and $d$ (the number of knapsacks). To the best of our knowledge, this is the first problem-dependent logarithmic regret bound for solving the general BwK problem.

[1]  Sahil Singla,et al.  Online Learning with Vector Costs and Bandits with Knapsacks , 2020, COLT.

[2]  Aleksandrs Slivkins,et al.  Advances in Bandits with Knapsacks , 2020, ArXiv.

[3]  Y. Ye,et al.  Online Linear Programming: Dual Convergence, New Algorithms, and Regret Bounds , 2019, Oper. Res..

[4]  Nicole Immorlica,et al.  Adversarial Bandits with Knapsacks , 2018, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[5]  David Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017, Oper. Res..

[6]  Itay Gurvich,et al.  Uniformly bounded regret in the multi-secretary problem , 2017, Stochastic Systems.

[7]  Patrick Jaillet,et al.  Logarithmic regret bounds for Bandits with Knapsacks , 2015, 1510.01800.

[8]  Nikhil R. Devanur,et al.  Linear Contextual Bandits with Knapsacks , 2015, NIPS.

[9]  Nikhil R. Devanur,et al.  An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.

[10]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[11]  Hamid Nazerzadeh,et al.  Real-time optimization of personalized assortments , 2013, EC '13.

[12]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[13]  Sunil Kumar,et al.  A Re-Solving Heuristic with Bounded Revenue Loss for Network Revenue Management with Customer Choice , 2012, Math. Oper. Res..

[14]  Omar Besbes,et al.  Blind Network Revenue Management , 2011, Oper. Res..

[15]  Amin Saberi,et al.  Online stochastic matching: online actions based on offline statistics , 2010, SODA '11.

[16]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT 2010.

[17]  Y. Ye,et al.  A Dynamic Near-Optimal Algorithm for Online Linear Programming , 2009, Oper. Res..

[18]  Aranyak Mehta,et al.  AdWords and generalized on-line matching , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[19]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[20]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[21]  N. Megiddo,et al.  On the ε-perturbation method for avoiding degeneracy , 1989 .

[22]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[23]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[24]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[25]  25th Annual Conference on Learning Theory Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2022 .