A message passing benchmark for unbalanced applications

Abstract We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI’s ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are explored: work sharing using a centralized work server and distributed work stealing using explicit polling to service steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional techniques to manage the volume of communication and mitigate runtime overhead. Using additional parameters, we observed an improvement of up to 3–4X in parallel performance. We report results for three distributed memory parallel computer systems and use UTS to characterize the performance and scalability on these systems. Overall, we find that the simpler work sharing approach with a single work server achieves good performance on hundreds of processors and that our distributed work stealing implementation scales to thousands of processors and delivers more robust performance that is less sensitive to the particular workload and load balancing parameters.

[1]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[2]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[3]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[4]  Henri E. Bal,et al.  Satin: Efficient Parallel Divide-and-Conquer in Java , 2000, Euro-Par.

[5]  M. Levitt Protein folding by restrained energy minimization and molecular dynamics. , 1983, Journal of molecular biology.

[6]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[7]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[8]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[9]  T. E. Harris,et al.  The Theory of Branching Processes. , 1963 .

[10]  Laxmikant V. Kalé,et al.  Adaptive Load Balancing for MPI Programs , 2001, International Conference on Computational Science.

[11]  Vipin Kumar,et al.  Parallel depth first search. Part II. Analysis , 1987, International Journal of Parallel Programming.

[12]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[13]  Tarek A. El-Ghazawi,et al.  UPC benchmarking issues , 2001, International Conference on Parallel Processing, 2001..

[14]  William Pugh,et al.  On Parallel Hashing and Integer Sorting (Extended Summary) , 1990, ICALP.

[15]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[16]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[17]  A. J. M. van Gasteren,et al.  Derivation of a Termination Detection Algorithm for Distributed Computations , 1983, Inf. Process. Lett..

[18]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[19]  Henri E. Bal,et al.  Efficient load balancing for wide-area divide-and-conquer applications , 2001, PPoPP '01.

[20]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[21]  Leslie Ann Goldberg,et al.  The natural work-stealing algorithm is stable , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.