Combinatorial Blocking Bandits with Stochastic Delays

Recent work has considered natural variations of the multi-armed bandit problem, where the reward distribution of each arm is a special function of the time passed since its last pulling. In this direction, a simple (yet widely applicable) model is that of blocking bandits, where an arm becomes unavailable for a deterministic number of rounds after each play. In this work, we extend the above model in two directions: (i) We consider the general combinatorial setting where more than one arms can be played at each round, subject to feasibility constraints. (ii) We allow the blocking time of each arm to be stochastic. We first study the computational/unconditional hardness of the above setting and identify the necessary conditions for the problem to become tractable (even in an approximate sense). Based on these conditions, we provide a tight analysis of the approximation guarantee of a natural greedy heuristic that always plays the maximum expected reward feasible subset among the available (non-blocked) arms. When the arms’ expected rewards are unknown, we adapt the above heuristic into a bandit algorithm, based on UCB, for which we provide sublinear (approximate) regret guarantees, matching the theoretical lower bounds in the limiting case of absence of delays.

[1]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[2]  Long Tran-Thanh,et al.  Adversarial Blocking Bandits , 2020, NeurIPS.

[3]  Sanjay Shakkottai,et al.  Blocking Bandits , 2019, NeurIPS.

[4]  Zheng Wen,et al.  Combinatorial Cascading Bandits , 2015, NIPS.

[5]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[6]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit with General Reward Functions , 2016, NIPS.

[7]  John N. Tsitsiklis,et al.  The Complexity of Optimal Queuing Network Control , 1999, Math. Oper. Res..

[8]  Jirí Sgall,et al.  Periodic scheduling with obligatory vacations , 2009, Theor. Comput. Sci..

[9]  Steffen Grünewälder,et al.  Recovering Bandits , 2019, NeurIPS.

[10]  Wei Chen,et al.  Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications , 2017, NIPS.

[11]  Aravind Srinivasan,et al.  Allocation Problems in Ride-sharing Platforms , 2017, AAAI.

[12]  Zachary C. Lipton,et al.  Rebounding Bandits for Modeling Satiation Effects , 2020, ArXiv.

[13]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[14]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[15]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[16]  Nicolò Cesa-Bianchi,et al.  Stochastic Bandits with Delay-Dependent Payoffs , 2020, AISTATS.

[17]  Constantine Caramanis,et al.  Recurrent Submodular Welfare and Matroid Blocking Bandits , 2021, ArXiv.

[18]  Nicole Immorlica,et al.  Recharging Bandits , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Sanjay Shakkottai,et al.  Contextual Blocking Bandits , 2020, AISTATS.

[21]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[22]  Venkatesan Guruswami,et al.  Near-optimal hardness results and approximation algorithms for edge-disjoint paths and related problems , 1999, STOC '99.

[23]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[24]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[25]  Peng Shi,et al.  Approximation algorithms for restless bandit problems , 2007, JACM.

[26]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[27]  Yajun Wang,et al.  Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[28]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[29]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[30]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.