An Efficient Pessimistic-Optimistic Algorithm for Constrained Linear Bandits

This paper considers stochastic linear bandits with general constraints. The objective is to maximize the expected cumulative reward over horizon $T$ subject to a set of constraints in each round $\tau\leq T$. We propose a pessimistic-optimistic algorithm for this problem, which is efficient in two aspects. First, the algorithm yields $\tilde{\cal O}\left(\left(\frac{K^{1.5}}{\delta^2}+d\right)\sqrt{\tau}\right)$ (pseudo) regret in round $\tau\leq T,$ where $K$ is the number of constraints, $d$ is the dimension of the reward feature space, and $\delta$ is a Slater's constant; and zero constraint violation in any round $\tau>\tau',$ where $\tau'$ is independent of horizon $T.$ Second, the algorithm is computationally efficient. Our algorithm is based on the primal-dual approach in optimization, and includes two components. The primal component is similar to unconstrained stochastic linear bandits (our algorithm uses the linear upper confidence bound algorithm (LinUCB)). The computational complexity of the dual component depends on the number of constraints, and is independent of sizes of the contextual space, the action space, and even the feature space. So the overall computational complexity of our algorithm is similar to the linear UCB for unconstrained stochastic linear bandits.

[1]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[2]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[3]  Mohammad Ghavamzadeh,et al.  Stochastic Bandits with Linear Constraints , 2020, AISTATS.

[4]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[5]  Christos Thrampoulidis,et al.  Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[6]  Xiaohan Wei,et al.  Online Convex Optimization with Stochastic Constraints , 2017, NIPS.

[7]  Michael J. Neely,et al.  Energy-Aware Wireless Scheduling With Near-Optimal Backlog and Convergence Time Tradeoffs , 2014, IEEE/ACM Transactions on Networking.

[8]  Haipeng Luo,et al.  Fair Contextual Multi-Armed Bandits: Theory and Experiments , 2019, UAI.

[9]  Lei Ying,et al.  POND: Pessimistic-Optimistic oNline Dispatch , 2020, ArXiv.

[10]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[11]  Jia Liu,et al.  Combinatorial Sleeping Bandits with Fairness Constraints , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[12]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[13]  John Langford,et al.  Resourceful Contextual Bandits , 2014, COLT.

[14]  R. Srikant,et al.  Asymptotically tight steady-state queue length bounds implied by drift conditions , 2011, Queueing Syst. Theory Appl..

[15]  Tor Lattimore,et al.  Refined Lower Bounds for Adversarial Bandits , 2016, NIPS.

[16]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[17]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[18]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[19]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[20]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[21]  R. Srikant,et al.  Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[22]  Atilla Eryilmaz,et al.  Budget-Constrained Bandits over General Cost and Reward Distributions , 2020, AISTATS.

[23]  David Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017, Oper. Res..

[24]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[25]  Andreas Krause,et al.  Safe Convex Learning under Uncertain Constraints , 2019, AISTATS.

[26]  Lei Ying,et al.  Communication Networks - An Optimization, Control, and Stochastic Networks Perspective , 2014 .

[27]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[28]  Hao Yu,et al.  A Low Complexity Algorithm with O(√T) Regret and O(1) Constraint Violations for Online Convex Optimization with Long Term Constraints , 2020, J. Mach. Learn. Res..

[29]  Xiaohan Wei,et al.  Online Primal-Dual Mirror Descent under Stochastic Constraints , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[30]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[31]  Nikhil R. Devanur,et al.  Linear Contextual Bandits with Knapsacks , 2015, NIPS.

[32]  Nikhil R. Devanur,et al.  An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.

[33]  Rong Jin,et al.  Trading regret for efficiency: online convex optimization with long term constraints , 2011, J. Mach. Learn. Res..

[34]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[35]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[36]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[37]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[38]  B. Hajek Hitting-time and occupation-time bounds implied by drift analysis with applications , 1982, Advances in Applied Probability.

[39]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .