论文信息 - Stochastic Bandits with Linear Constraints

Stochastic Bandits with Linear Constraints

We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of $T$ rounds is maximum, and each has an expected cost below a certain threshold $\tau$. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove an $\widetilde{\mathcal{O}}(\frac{d\sqrt{T}}{\tau-c_0})$ bound on its $T$-round regret, where the denominator is the difference between the constraint threshold and the cost of a known feasible action. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting. We prove a regret bound of $\widetilde{\mathcal{O}}(\frac{\sqrt{KT}}{\tau - c_0})$ for this algorithm in $K$-armed bandits, which is a $\sqrt{K}$ improvement over the regret bound we obtain by simply casting multi-armed bandits as an instance of contextual linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results.

[1] Santiago Ontañón,et al. The Combinatorial Multi-Armed Bandit Problem and Its Application to Real-Time Strategy Games , 2013, AIIDE.

[2] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[3] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.

[4] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[5] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[6] Csaba Szepesvari,et al. Bandit Algorithms , 2020 .

[7] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[8] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[9] R. Srikant,et al. Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[10] Nikhil R. Devanur,et al. Bandits with concave rewards and convex knapsacks , 2014, EC.

[11] Jack Bowden,et al. Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[12] Francesca Rossi,et al. Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation , 2018, IJCAI.

[13] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[14] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15] Robert B. Washburn,et al. Application of Multi-Armed Bandits to Sensor Management , 2008 .

[16] Nikhil R. Devanur,et al. Linear Contextual Bandits with Knapsacks , 2015, NIPS.

[17] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[18] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[19] Alessandro Lazaric,et al. Improved Algorithms for Conservative Exploration in Bandits , 2020, AAAI.

[20] Setareh Maghsudi,et al. Multi-armed bandits with application to 5G small cells , 2015, IEEE Wireless Communications.

[21] Christos Thrampoulidis,et al. Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[22] Aleksandrs Slivkins,et al. Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[23] Christos Thrampoulidis,et al. Safe Linear Thompson Sampling With Side Information , 2021, IEEE Transactions on Signal Processing.

[24] Benjamin Van Roy,et al. Conservative Contextual Linear Bandits , 2016, NIPS.

[25] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[26] John Langford,et al. Resourceful Contextual Bandits , 2014, COLT.

[27] Christos Thrampoulidis,et al. Safe Linear Thompson Sampling , 2019, ArXiv.

[28] Yifan Wu,et al. Conservative Bandits , 2016, ICML.