Adaptive Two-level Thread Management for Fast MPI Execution on Shared Memory Machines

This paper addresses performance portability of MPI code on multiprogrammed shared memory machines. Conventional MPI implementations map each MPI node to an OS process, which suffers severe performance degradation in multiprogrammed environments. Our previous work (TMPI) has developed compile/run-time techniques to support threaded MPI execution by mapping each MPI node to a kernel thread. However, kernel threads have context switch cost higher than user-level threads and this leads to longer spinning time requirement during MPI synchronization. This paper presents an adaptive two-level thread scheme for MPI to reduce context switch and synchronization cost. This scheme also exposes thread scheduling information at user-level, which allows us to design an adaptive event waiting strategy to minimize CPU spinning and exploit cache affinity. Our experiments show that the MPI system based on the proposed techniques has great performance advantages over the previous version of TMPI and the SGI MPI implementation in multiprogrammed environments. The improvement ratio can reach as much as 161% or even more depending on the degree of multiprogramming.

[1]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[2]  Bin Jiang,et al.  Efficient Sparse LU Factorization with Lazy Space Allocation , 1999, PPSC.

[3]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[4]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[5]  Henri Casanova,et al.  NetSovle: A Network Server for Solving Computational Science Problems , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[6]  Shikharesh Majumdar,et al.  Scheduling in multiprogrammed parallel systems , 1988, SIGMETRICS 1988.

[7]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[8]  Anoop Gupta,et al.  Process control and scheduling issues for multiprogrammed shared-memory multiprocessors , 1989, SOSP '89.

[9]  David J. Lilja,et al.  An Effective Processor Allocation Strategy for Multiprogrammed Shared-Memory Multiprocessors , 1997, IEEE Trans. Parallel Distributed Syst..

[10]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[11]  John Zahorjan,et al.  Processor scheduling in shared memory multiprocessors , 1990, SIGMETRICS '90.

[12]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[13]  David J. Lilja,et al.  Dynamic processor allocation with the Solaris operating system , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[14]  Tao Yang,et al.  Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines , 1999, PPoPP '99.

[15]  Eleftherios D. Polychronopoulos,et al.  Kernel-level scheduling for the nano-threads programming model , 1998, ICS '98.

[16]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[17]  Michael L. Scott,et al.  Scheduler-conscious synchronization , 1997, TOCS.

[18]  Guang R. Gao,et al.  How “hard” is thread partitioning and how “bad” is a list scheduling based partitioning algorithm? , 1998, SPAA '98.

[19]  Evangelos P. Markatos,et al.  Multiprogramming on multiprocessors , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[20]  Raj Vaswani,et al.  The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors , 1991, SOSP '91.

[21]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[22]  Steve R. Kleiman,et al.  SunOS Multi-thread Architecture , 1991, USENIX Winter.

[23]  Tao Yang,et al.  Elimination forest guided 2D sparse LU factorization , 1998, SPAA '98.

[24]  Mary K. Vernon,et al.  The performance of multiprogrammed multiprocessor scheduling algorithms , 1990, SIGMETRICS '90.

[25]  Anna R. Karlin,et al.  Competitive randomized algorithms for non-uniform problems , 1990, SODA '90.

[26]  Dror G. Feitelson,et al.  Job Scheduling in Multiprogrammed Parallel Systems , 1997 .

[27]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[28]  V. Rich Personal communication , 1989, Nature.

[29]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[30]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[31]  Anthony Skjellum,et al.  A Multithreaded Message Passing Interface (MPI) Architecture: Performance and Program Issues , 2001, J. Parallel Distributed Comput..