Performance modeling of communication and computation in hybrid MPI and OpenMP applications

Performance evaluation and modeling is a crucial process to enable the optimization of parallel programs. Programs written using two programming models, such as MPI and OpenMP, require an analysis to determine both performance efficiency and the most suitable numbers of processes and threads for their execution on a given platform. To study both of these problems, we propose the construction of a model that is based upon a small number of parameters, but is able to capture the complexity of the runtime system. We must incorporate measurements of overheads introduced by each of the programming models, and thus need to model both the network and computational aspects of the system. We have combined two different techniques: static analysis, driven by the OpenUH compiler, to retrieve application signatures and a parallelization overhead measurement benchmark, realized by Sphinx and Perfsuite, to collect system profiles. Finally, we propose a performance evaluation measurement to identify communication and computation efficiency. In this paper we describe our underlying framework, the performance model, and show how our tool can be applied to a sample code

[1]  Ralf H. Reussner,et al.  SKaMPI: A Detailed, Accurate MPI Benchmark , 1998, PVM/MPI.

[2]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[3]  Rick Kufrin,et al.  PerfSuite: An Accessible, Open Source Performance Analysis Environment for Linux , 2005 .

[4]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[5]  Andrew Wolfe,et al.  Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture , 2000, MICRO 2000.

[6]  Danesh K. Tafti,et al.  A Parallel Computing Framework for Dynamic Power Balancing in Adaptive Mesh Refinement Applications , 2000 .

[7]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[8]  Jesper Larsson Träff,et al.  SKaMPI: a comprehensive benchmark for public benchmarking of MPI , 2002, Sci. Program..

[9]  Martin B. van Gijzen,et al.  Two Level Parallelism in a Stream-Function Model for Global Ocean Circulation , 2003, Euro-Par.

[10]  D. S. Henty,et al.  Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Leonid Oliker,et al.  Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations , 2013, SIAM Rev..

[12]  P. Aldo Moro Conjugate-Gradients Algorithms : An MPI-OpenMP Implementation on Distributed Shared Memory Systems , 1999 .

[13]  K. Liew,et al.  Parallel-multigrid computation of unsteady incompressible viscous flows using a matrix-free implicit method and high-resolution characteristics-based scheme , 2005 .

[14]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[15]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[16]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  G. Mahinthakumar,et al.  A Hybrid Mpi-Openmp Implementation of an Implicit Finite-Element Code on Parallel Architectures , 2002, Int. J. High Perform. Comput. Appl..

[18]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[19]  Jeffrey K. Hollingsworth,et al.  Using Dynamic Tracing Sampling to Measure Long Running Programs , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[20]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[21]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[22]  Mary K. Vernon,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004, TOCS.

[23]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[24]  Amitava Majumdar Parallel performance study of Monte Carlo photon transport code on shared-, distributed-, and distributed-shared-memory architectures , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[25]  Rainer Unland,et al.  Objects, Components, Architectures, Services, and Applications for a Networked World , 2003, Lecture Notes in Computer Science.

[26]  Yossi Matias,et al.  Can shared-memory model serve as a bridging model for parallel computation? , 1997, SPAA '97.

[27]  Kathryn S. McKinley,et al.  A Compiler Optimization Algorithm for Shared-Memory Multiprocessors , 1998, IEEE Trans. Parallel Distributed Syst..

[28]  Viera Sipková,et al.  Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs , 2004, International Journal of Parallel Programming.

[29]  Mellor-CrummeyJohn,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004 .

[30]  Joseph JáJá,et al.  Prefix computations on symmetric multiprocessors , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[31]  Nectarios Koziris,et al.  Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[32]  Luc Giraud,et al.  Combining Shared and Distributed Memory Programming Models on Clusters of Symmetric Multiprocessors: Some Basic Promising Experiments , 2002, Int. J. High Perform. Comput. Appl..

[33]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[34]  S AdveVikram,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004 .

[35]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[36]  André Weinand Eclipse - An Open Source Platform for the Next Generation of Development Tools , 2002, NetObjectDays.

[37]  Rudolf Eigenmann,et al.  Parallel programming with message passing and directives , 2001, Comput. Sci. Eng..

[38]  Thomas Rauber,et al.  A source code analyzer for performance prediction , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..