Learning Proportionally Fair Allocations with Low Regret

This paper addresses a generic sequential resource allocation problem, where in each round a decision maker selects an allocation of resources (servers) to a set of tasks consisting of a large number of jobs. A job of task i assigned to server j is successfully treated with probability θ_ij $ in a round, and the decision maker is informed on whether this job is completed at the end of the round. The probabilities θ_ij $'s are initially unknown and have to be learned. The objective of the decision maker is to sequentially assign jobs of various tasks to servers so that it rapidly learns and converges to the Proportionally Fair (PF) allocation (or other similar allocations achieving an appropriate trade-off between efficiency and fairness). We formulate the problem as a multi-armed bandit (MAB) optimization problem, and devise sequential assignment algorithms with low regret (defined as the difference in utility achieved by an oracle algorithm aware of the θ_ij $'s and by the proposed algorithm over a given number of slots). We first provide the properties of the so-called Restricted-PF (RPF) allocation, obtained by assuming that each task can only use a single server, and in particular show that it is very close to the PF allocation. We devise ES-RPF, an algorithm that learns the RPF allocation with regret no greater than $\mathcal O \bigl(m^3øver θ_\min Δ_\min łog(T)\big)$ after T slots, where m , θ_\min $, and Δ_\min $ represent the number of tasks, the minimum success rate $\min_i,j θ_ij $, and an appropriately defined notion of gap, respectively. We further provide regret lower bounds satisfied by any algorithm targeting the RPF allocation. Finally, we present ES-PF, an algorithm directly learning the PF allocation, and prove that its regret does not exceed $\mathcal O \bigl(\fracm^2s θ_\min \sqrtT łog(T)\big)$ after T slots, where s denotes the number of servers.

[1]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[2]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[3]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[4]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[5]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit with General Reward Functions , 2016, NIPS.

[6]  A. Schrijver A Course in Combinatorial Optimization , 1990 .

[7]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[8]  Mohammad Sadegh Talebi,et al.  Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[9]  Yashodhan Kanoria,et al.  Matching while Learning , 2016, EC.

[10]  Jean C. Walrand,et al.  Fair end-to-end window-based congestion control , 2000, TNET.

[11]  Laurent Massoulié,et al.  A queueing analysis of max-min fairness, proportional fairness and balanced fairness , 2006, Queueing Syst. Theory Appl..

[12]  David Tse,et al.  Fundamentals of Wireless Communication , 2005 .

[13]  Zheng Wen,et al.  Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[14]  Alexandre Proutière,et al.  Distributed Proportional Fair Load Balancing in Heterogenous Systems , 2015, SIGMETRICS.

[15]  Sham M. Kakade,et al.  Stochastic Convex Optimization with Bandit Feedback , 2011, SIAM J. Optim..

[16]  Koby Crammer,et al.  Optimal Resource Allocation with Semi-Bandit Feedback , 2014, UAI.

[17]  T. L. Graves,et al.  Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains , 1997 .

[18]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[19]  Richard Combes,et al.  Stochastic Online Shortest Path Routing: The Value of Feedback , 2013, IEEE Transactions on Automatic Control.

[20]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[21]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[22]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[23]  Koby Crammer,et al.  Linear Multi-Resource Allocation with Semi-Bandit Feedback , 2015, NIPS.

[24]  Frank Kelly,et al.  Rate control for communication networks: shadow prices, proportional fairness and stability , 1998, J. Oper. Res. Soc..

[25]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[26]  F. Topsøe Some Bounds for the Logarithmic Function , 2004 .

[27]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.