Scheduling Efficiently for Irregular Load Distributions in a Large-scale Cluster

Random stealing is a well-known dynamic scheduling algorithm. However, in a large-scale cluster, an idle node must randomly steal many times to obtain a task from another node, especially, this problem severely affects performance in systems where only a few nodes generate most of the system workload. In this paper, we present an efficient dynamic scheduling algorithm, Transitive Random Stealing (TRS) based on random stealing, which makes any idle node rapidly obtain a task from another node for irregular load distributions in a large-scale cluster. Then by the random baseline technique, we experimentally compare TRS with Shis, one of load balance policies in the EARTH system, and random stealing for different load distributions in the Tsinghua EastSun cluster and show that TRS is a highly efficient scheduling algorithm for irregular load distributions in a large-scale cluster. Finally, TRS is implemented in the Jcluster environment, a high performance Java parallel environment, and an experiment result is given in the HKU Gideon 300 cluster.

[1]  Mukesh Singhal,et al.  Load distributing for locally distributed systems , 1992, Computer.

[2]  Leslie Ann Goldberg,et al.  The Natural Work-Stealing Algorithm is Stable , 2001, SIAM J. Comput..

[3]  Peter Sanders,et al.  Randomized Receiver Initiated Load-balancing Algorithms for Tree-shaped Computations , 2002, Comput. J..

[4]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[5]  Phillip Krueger,et al.  Two adaptive location policies for global scheduling algorithms , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[6]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[7]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[8]  Henri E. Bal,et al.  Satin: Efficient Parallel Divide-and-Conquer in Java , 2000, Euro-Par.

[9]  Edward D. Lazowska,et al.  A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing , 1986, Perform. Evaluation.

[10]  H. T. Kung,et al.  Communication complexity for parallel divide-and-conquer , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.