Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes.

[1]  R. Sarnath,et al.  Proceedings of the International Conference on Parallel Processing , 1992 .

[2]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[3]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[4]  Weisong Shi,et al.  Shared Virtual Memory: A Survey , 1998 .

[5]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[6]  Thomas R. Gross,et al.  Transparent adaptive parallelism on NOWs using OpenMP , 1999, PPoPP '99.

[7]  Xavier Martorell,et al.  NanosCompiler: A Research Platform for OpenMP Extensions , 1999 .

[8]  Mitsuhisa Sato,et al.  Design of OpenMP Compiler for an SMP Cluster , 1999 .

[9]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[10]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  D. S. Henty,et al.  Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  Siegfried Benkner,et al.  High-Level Data Mapping for Clusters of SMPs , 2001, IPDPS.

[13]  Rolf Rabenseifner,et al.  Benchmark design for characterization of balanced high-performance architectures , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[14]  Siegfried Benkner,et al.  High-level data mapping for clusters of SMPs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[15]  Mark Bull,et al.  Development of mixed mode MPI / OpenMP applications , 2001, Sci. Program..

[16]  Frank Mueller Proceedings of the 6th International Workshop on High-Level Parallel Programming Models and Supportive Environments , 2001 .

[17]  Tarek A. El-Ghazawi,et al.  UPC benchmarking issues , 2001, International Conference on Parallel Processing, 2001..

[18]  Rolf Rabenseifner,et al.  Effective Communication and File-I/O Bandwidth Benchmarks , 2001, PVM/MPI.

[19]  R.D. Loft,et al.  Terascale Spectral Element Dynamical Core for Atmospheric General Circulation Models , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[20]  Matthias M. Müller Compiler-Generated Vector-based Prefetching on Architectures with Distributed Memory , 2002 .

[21]  Matthias S. Müller,et al.  Experiences u:;ing OpenMP based on Compiler Directed S ftware DSM on a PC Cluster , 2022 .

[22]  Gerhard Wellein,et al.  Fast Sparse Matrix-Vector Multiplication for TeraFlop/s Computers , 2002, VECPAR.

[23]  Hiroshi Takahara,et al.  A 26.58 Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[24]  Mitsuo Yokokawa,et al.  An MPI Benchmark Program Library and Its Application to the Earth Simulator , 2002, ISHPC.

[25]  Gerhard Wellein,et al.  Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture , 2003 .

[26]  Arndt Bode,et al.  High Performance Computing in Science and Engineering,Munich 2002 , 2003 .

[27]  Hans P. Zima,et al.  The Earth Simulator , 2004, Parallel Comput..