Performance modelling for task-parallel programs

Many applications from scientific computing and physical simulations can benefit from a mixed task and data parallel implementation on parallel machines with a distributed memory organization, but it may also be the case that a pure data parallel implementation leads to faster execution times. Since the effort for writing a mixed task and data parallel implementation is large, it would be useful to have an a priori estimation of the possible benefits of such an implementation on a given parallel machine. In this article, we propose an estimation method for the execution time that is based on the modelling of computation and communication times by runtime formulas. The effect of concurrent message transmissions is captured by a contention factor for the specific target machine. To demonstrate the usefulness of the approach, we consider a complex method for the solution of ordinary differential equations with a potential for a mixed task and data parallel execution. As distributed memory machine we consider the Cray T3E and a Linux cluster.

[1]  Thomas Rauber,et al.  Parallel iterated Runge-Kutta methods and applications , 1994 .

[2]  Thomas Rauber,et al.  Modelling the runtime of scientific programs on parallel computers , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[3]  William F. McColl Universal Computing , 1996, Euro-Par, Vol. I.

[4]  Thomas Rauber,et al.  PVM and MPI Communication Operations on the IBM SP2: Modeling and Comparison , 1997 .

[5]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[6]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[7]  Thomas Rauber,et al.  Optimizing locality for ODE solvers , 2001, ICS '01.

[8]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[9]  Thomas Rauber,et al.  A Transformation Approach to Derive Efficient Parallel Implementations , 2000, IEEE Trans. Software Eng..

[10]  S. Lennart Johnsson,et al.  Performance Modeling of Distributed Memory Architectures , 1991, J. Parallel Distributed Comput..

[11]  David B. Skillicorn,et al.  Questions and Answers about BSP , 1997, Sci. Program..

[12]  Thomas Rauber,et al.  Parallel Implementations of Iterated Runge-Kutta Methods , 1996, Int. J. High Perform. Comput. Appl..

[13]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[14]  Zhiwei Xu,et al.  Early Prediction of MPP Performance: Th SP2, T3D, and Paragon Experiences , 1996, Parallel Comput..

[15]  Thomas Rauber,et al.  Modeling the communication behavior of the Intel Paragon , 1997, Proceedings Fifth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[16]  Zhiwei Xu,et al.  Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing , 1996, IEEE Trans. Parallel Distributed Syst..