Considerations on Distributed Load Balancing for Fully Heterogeneous Machines: Two Particular Cases

When the size of parallel systems increases, centralized algorithms to schedule tasks on the system can induce a significant overhead. This is why decentralized scheduling algorithms have been developed. The most popular one certainly is work-stealing because of its interesting theoretical guarantees. Parallel systems have evolved from homogeneous clusters to fully heterogeneous ones such as GPU-accelerated clusters. We investigate in this paper decentralized scheduling algorithms for heterogeneous systems. The guarantees of work-stealing algorithms no longer hold on such systems because it is an a posteriori algorithm which highly depends on the initial distribution of work. We focus on a priori decentralized scheduling algorithms for heterogeneous systems and we propose two distributed algorithms to balance the load on unrelated machines for two particular cases. The first one exploits a low heterogeneity in the task set and reaches an approximation ratio linear in the number of types of tasks. The second one focuses on the case where the system only uses two different types of machines and we show it is a 2-approximation if the system converges. In the case it does not converge, we study the dynamic equilibrium of the system. In the homogeneous case, we numerically compute the probability density function of the load imbalance and show that the imbalance is low on average. And we show using simulation that the heterogeneous case is similar to the homogeneous case and that the imbalance is low in both cases.

[1]  Michael A. Bender,et al.  Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.

[2]  David B. Shmoys,et al.  Using dual approximation algorithms for scheduling problems: practical and theoretical results , 1987 .

[3]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[4]  Thomas Sauerwald,et al.  Balls-into-bins with nearly optimal load distribution , 2013, SPAA.

[5]  David S. Johnson,et al.  `` Strong '' NP-Completeness Results: Motivation, Examples, and Implications , 1978, JACM.

[6]  Thomas Rauber,et al.  Performance Evaluation of Task Pools Based on Hardware Synchronization , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7]  Jan Karel Lenstra,et al.  Approximation algorithms for scheduling unrelated parallel machines , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[8]  Ümit V. Çatalyürek,et al.  Improving performance of adaptive component-based dataflow middleware , 2012, Parallel Comput..

[9]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[10]  Ellis Horowitz,et al.  Exact and Approximate Algorithms for Scheduling Nonidentical Processors , 1976, JACM.

[11]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[12]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[13]  David B. Shmoys,et al.  Using dual approximation algorithms for scheduling problems: Theoretical and practical results , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[14]  P. Berenbrink,et al.  Balls into non-uniform bins , 2014 .

[15]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[16]  Eugene L. Lawler,et al.  On Preemptive Scheduling of Unrelated Parallel Processors by Linear Programming , 1978, JACM.

[17]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[18]  Lin Chen,et al.  Online Scheduling on a CPU-GPU Cluster , 2013, TAMC.

[19]  Denis Trystram,et al.  A Tighter Analysis of Work Stealing , 2010, ISAAC.