A Hierarchical Load-Balancing Framework for Dynamic Multithreaded Computations

High-level parallel programming models supporting dynamic fine-grained threads in a global object space, are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to load-balancing. Load-balancing techniques must simultaneously incur low overhead to support fine-grained threads as well as be sophisticated enough to preserve data locality and thread execution priority. This paper presents a hierarchical framework which addresses these conflicting goals by viewing the computation as being made up of different thread subsets, each of which are load-balanced independently. In contrast to previous processor-centric approaches that have advocated the use of a uniform policy for load-balancing all threads in a computation, our framework allows each thread subset to be load-balanced using a policy most suited to its characteristics (e.g., locality or priority sensitivity). The framework consists of two parts: (i) language support which permits a programmer to tag different thread subsets with appropriate policies, and (ii) run-time support which synthesizes overall application load-balance by composing these individual policies. This framework has been implemented in the Illinois Concert runtime system, an execution platform for fine-grained concurrent object-oriented languages. Results for four large irregular applications on the Cray T3D and the SGI Origin 2000 demonstrate advantages of the hierarchical framework: performance improves by up to an order of magnitude as compared to using a uniform load-balancing policy.

[1]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[2]  Jaswinder Pal Singh,et al.  Hierarchical n-body methods and their implications for multiprocessors , 1993 .

[3]  Andrew S. Grimshaw,et al.  Easy-to-use object-oriented parallel processing with Mentat , 1993, Computer.

[4]  Anoop Gupta,et al.  Data locality and load balancing in COOL , 1993, PPOPP '93.

[5]  Andrew A. Chien,et al.  Optimizing COOP languages: study of a protein dynamics program , 1996, Proceedings of International Conference on Parallel Processing.

[6]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[7]  GuptaAnoop,et al.  Parallel Visualization Algorithms , 1994 .

[8]  Andrew A. Chien,et al.  ICC++-AC++ Dialect for High Performance Parallel Computing , 1996, ISOTAS.

[9]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[10]  Katherine A. Yelick,et al.  Implementing an irregular application on a distributed memory multiprocessor , 1993, PPOPP '93.

[11]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[12]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[13]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[14]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[15]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[16]  米沢 明憲 ABCL : an object-oriented concurrent system , 1990 .

[17]  Marc Levoy,et al.  Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[18]  N. Bose Multidimensional Systems Theory , 1985 .

[19]  Andrew A. Chien,et al.  Evaluating high level parallel programming support for irregular applications in ICC++ , 1998 .

[20]  William E. Weihl,et al.  Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[21]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[22]  Vipin Kumar,et al.  Scalable parallel formulations of the barnes-hut method for n-body simulations , 1994, Supercomputing '94.

[23]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[24]  Robert M. Keller,et al.  Simulated Performance of a Reduction-Based Multiprocessor , 1984, Computer.

[25]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[26]  Andrew A. Chien,et al.  Run-time techniques for dynamic multithreaded computations , 1998 .

[27]  B. Buchberger An Algorithmic Method in Polynomial Ideal Theory , 1985 .

[28]  Brian N. Bershad,et al.  PRESTO: A system for object‐oriented parallel programming , 1988, Softw. Pract. Exp..

[29]  Laxmikant V. Kalé,et al.  Converse: an interoperable framework for parallel programming , 1996, Proceedings of International Conference on Parallel Processing.

[30]  Harrick M. Vin,et al.  A hierarchial CPU scheduler for multimedia operating systems , 1996, OSDI '96.

[31]  Andrew A. Chien,et al.  Evaluating high level parallel programming support for irregular applications in ICC++ , 1998, Softw. Pract. Exp..

[32]  Ian T. Foster,et al.  The Nexus Approach to Integrating Multithreading and Communication , 1996, J. Parallel Distributed Comput..

[33]  Andrew A. Chien,et al.  Supporting high level programming with high performance: the Illinois Concert system , 1997, Proceedings Second International Workshop on High-Level Parallel Programming Models and Supportive Environments.

[34]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[35]  A ChienAndrew ICC++a C++ dialect for high performance parallel computing , 1996 .

[36]  J.A. Jones,et al.  Parallelizing the Phylogeny Problem , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[37]  Matthew Haines,et al.  On the design of Chant: a talking threads package , 1994, Proceedings of Supercomputing '94.

[38]  Scott Pakin,et al.  Fast messages: efficient, portable communication for workstation clusters and MPPs , 1997, IEEE Concurrency.