Combinatorial Sleeping Bandits With Fairness Constraints

The multi-armed bandit (MAB) model has been widely adopted for studying many practical optimization problems (network resource allocation, ad placement, crowdsourcing, etc.) with unknown parameters. The goal of the player (i.e., the decision maker) here is to maximize the cumulative reward in the face of uncertainty. However, the basic MAB model neglects several important factors of the system in many real-world applications, where multiple arms (i.e., actions) can be simultaneously played and an arm could sometimes be “sleeping” (i.e., unavailable). Besides reward maximization, ensuring <italic>fairness</italic> is also a key design concern in practice. To that end, we propose a new <italic>Combinatorial Sleeping MAB model with Fairness constraints</italic>, called <italic>CSMAB-F</italic>, aiming to address the aforementioned crucial modeling issues. The objective is now to maximize the reward while satisfying the fairness requirement of a minimum selection fraction for each individual arm. To tackle this new problem, we extend an online learning algorithm, called <italic>Upper Confidence Bound (UCB)</italic>, to deal with a critical tradeoff between <italic>exploitation</italic> and <italic>exploration</italic> and employ the virtual queue technique to properly handle the fairness constraints. By carefully integrating these two techniques, we develop a new algorithm, called <italic>Learning with Fairness Guarantee (LFG)</italic>, for the CSMAB-F problem. Further, we rigorously prove that not only LFG is <italic>feasibility-optimal</italic>, but it also has a time-average <italic>regret</italic> upper bounded by <inline-formula><tex-math notation="LaTeX">$\frac{N}{2 \eta } + \frac{\beta _1 \sqrt{m N T \log {T}}+ \beta _2~N}{T}$</tex-math></inline-formula>, where <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> is the total number of arms, <inline-formula><tex-math notation="LaTeX">$m$</tex-math></inline-formula> is the maximum number of arms that can be simultaneously played, <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> is the time horizon, <inline-formula><tex-math notation="LaTeX">$\beta _1$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\beta _2$</tex-math></inline-formula> are constants, and <inline-formula><tex-math notation="LaTeX">$\eta$</tex-math></inline-formula> is a design parameter that we can tune. Finally, we perform extensive simulations to corroborate the effectiveness of the proposed algorithm. Interestingly, the simulation results reveal an important tradeoff between the regret and the speed of convergence to a point satisfying the fairness constraints.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  Y. Narahari,et al.  Achieving Fairness in the Stochastic Multi-armed Bandit Problem , 2019, AAAI.

[3]  Y. Narahari,et al.  Analysis of Thompson Sampling for Stochastic Sleeping Bandits , 2017, UAI.

[4]  John C. S. Lui,et al.  An Online Learning Approach to Network Application Optimization with Guarantee , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[5]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[6]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[7]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[8]  Seth Neel,et al.  Fair Algorithms for Infinite and Contextual Bandits , 2016, 1610.09559.

[9]  Alexandre Proutière,et al.  Learning Proportionally Fair Allocations with Low Regret , 2018, SIGMETRICS.

[10]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[11]  Xiaojun Lin,et al.  Integrate Learning and Control in Queueing Systems with Uncertain Payoff , 2017 .

[12]  Ness B. Shroff,et al.  A framework for opportunistic scheduling in wireless networks , 2003, Comput. Networks.

[13]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[14]  Theodore S. Rappaport,et al.  Wireless communications - principles and practice , 1996 .

[15]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[16]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[17]  Ashutosh Sabharwal,et al.  An Axiomatic Theory of Fairness in Network Resource Allocation , 2009, 2010 Proceedings IEEE INFOCOM.

[18]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[19]  Uriel G. Rothblum,et al.  The multi-armed bandit, with constraints , 2012, Annals of Operations Research.

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[22]  Vivek S. Borkar,et al.  A Theory of QoS for Wireless , 2009, IEEE INFOCOM 2009.

[23]  Jia Liu,et al.  Combinatorial Sleeping Bandits with Fairness Constraints , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[24]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[25]  Theodore S. Rappaport,et al.  Wireless Communications: Principles and Practice (2nd Edition) by , 2012 .

[26]  Varun Kanade,et al.  Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards , 2009, AISTATS.

[27]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[28]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit with General Reward Functions , 2016, NIPS.

[29]  Thomas Steinke,et al.  Learning hurdles for sleeping experts , 2012, ITCS '12.

[30]  Jie Xu,et al.  Contextual Combinatorial Multi-armed Bandits with Volatile Arms and Submodular Reward , 2018, NeurIPS.

[31]  John C. S. Lui,et al.  Beyond the Click-Through Rate: Web Link Selection with Multi-level Feedback , 2018, IJCAI.

[32]  Kun-Lung Wu,et al.  Fair Task Allocation in Crowdsourced Delivery , 2018, ArXiv.

[33]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.