Learning and balancing time-varying loads in large-scale systems

Consider a system of n parallel server pools where tasks arrive as a time-varying Poisson process. The system aims at balancing the load by using an inner control loop with an admission threshold to assign incoming tasks to server pools; as an outer control loop, a learning scheme adjusts this threshold over time in steps of ∆ units, to keep it aligned with the time-varying overall load. If the fluctuations in the normalized load are smaller than ∆, then we prove that the threshold settles for all large enough n and balances the load when ∆ = 1. Our model captures a tradeoff between optimality and stability, since for higher ∆ the degree of balance decreases, but the threshold remains constant under larger load fluctuations. The analysis of this model is mathematically challenging, particularly since the learning scheme relies on subtle variations in the occupancy state of the system which vanish on the fluid scale; the methodology developed in this paper overcomes this hurdle by leveraging the tractability of the specific system dynamics. Strong approximations are used to prove certain dynamical properties which are then used to characterize the behavior of the system, without relying on a traditional fluid-limit analysis.

[1]  Sem C. Borst,et al.  Universality of load balancing schemes on the diffusion scale , 2016, J. Appl. Probab..

[2]  Fernando Paganini,et al.  Controlling the number of active instances in a cloud environment , 2018, PERV.

[3]  Anthony Ephremides,et al.  A simple dynamic routing problem , 1980 .

[4]  J. Tsitsiklis,et al.  Delay, Memory, and Messaging Tradeoffs in Distributed Service Systems , 2016, SIGMETRICS.

[5]  Sem C. Borst,et al.  Self-Learning Threshold-Based Load Balancing , 2020, ArXiv.

[6]  Mark Burgess,et al.  Dynamic pull-based load balancing for autonomic servers , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[7]  Sem C. Borst,et al.  Asymptotic Optimality of Power-of-d Load Balancing in Large-Scale Systems , 2016, Math. Oper. Res..

[8]  R. Fisher The Advanced Theory of Statistics , 1943, Nature.

[9]  Balakrishna J. Prabhu,et al.  Asymptotics of Insensitive Load Balancing and Blocking Phases , 2016, SIGMETRICS.

[10]  Maury Bramson,et al.  State space collapse with application to heavy traffic limits for multiclass queueing networks , 1998, Queueing Syst. Theory Appl..

[11]  Alexander L. Stolyar Pull-based load distribution in large-scale heterogeneous service systems , 2015, Queueing Syst. Theory Appl..

[12]  Christos G. Cassandras,et al.  Extremal properties of the shortest/longest non-full queue policies in finite-capacity systems with state-dependent service rates , 1993, Journal of Applied Probability.

[13]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[14]  T. Kurtz,et al.  Large loss networks , 1994 .

[15]  James R. Larus,et al.  Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services , 2011, Perform. Evaluation.

[16]  Tapani Lehtonen,et al.  On the optimality of the shortest line discipline , 1984 .

[17]  Richard F. Serfozo,et al.  Optimality of routing and servicing in dependent parallel processing systems , 1991, Queueing Syst. Theory Appl..

[18]  R. L. Dobrushin,et al.  Queueing system with selection of the shortest of two queues: an assymptotic approach , 1996 .

[19]  Alexander L. Stolyar,et al.  Join-Idle-Queue with Service Elasticity: Large-Scale Asymptotics of a Non-monotone System , 2018, ArXiv.

[20]  Sem C. Borst,et al.  Optimal Service Elasticity in Large-Scale Distributed Systems , 2017, Proc. ACM Meas. Anal. Comput. Syst..

[21]  Sem C. Borst,et al.  Scalable load balancing in networked systems: A survey of recent advances , 2018, SIAM Rev..

[22]  Fernando Paganini,et al.  Feedback control of server instances for right sizing in the cloud , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).