Automatic MPI application transformation with ASPhALT

This paper describes a source to source compilation tool for optimizing MPI-based parallel applications. This tool is able to automatically apply a "prepushing" transformation that causes MPI programs to aggressively send data as soon as it is available, thus improving communication-computation overlap and improving application performance. In this paper we present asphalt_transformer; the Open64-based component of our framework, ASPhALT, responsible for automatically performing the prepushing transformation. We also present an extensive study of the performance gains witnessed from automatically transformed codes. In particular, we demonstrate how different levels of aggregation affect the performance of parallel programs executing various computation kernels on different clusters. Furthermore, we discuss the differences in performance improvement between the hand-optimized and automatically optimized codes, as well as the effect of automation on time-to-solution.

[1]  Mohamed M. Zahran,et al.  Productivity analysis of the UPC language , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[3]  Alain Darte,et al.  the NESTOR Library: A Tool for Implementing FORTRAN Source Transformations , 1999, HPCN Europe.

[4]  Rudolf Eigenmann,et al.  Towards automatic translation of OpenMP to MPI , 2005, ICS '05.

[5]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[6]  Jack J. Dongarra,et al.  Performance Study of LU Factorization with Low Communication Overhead on Multiprocessors , 1995, Parallel Process. Lett..

[7]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[8]  Ken Kennedy,et al.  Compiler optimizations for Fortran D on MIMD distributed-memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[9]  Tarek S. Abdelrahman,et al.  Computation-Communication Overlap on Network-of-Workstation Multiprocessors , 2001 .

[10]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[11]  Wei Chen,et al.  Message Strip-Mining Heuristics for High Speed Networks , 2004, VECPAR.

[12]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[13]  Walter F. Tichy,et al.  Measuring High Performance Computing Productivity , 2004, Int. J. High Perform. Comput. Appl..

[14]  Ken Kennedy,et al.  Strategy for Compiling Parallel Matlab for General Distributions , 2006 .

[15]  Mark J. Clement,et al.  Overlapping Computations, Communications and I/O in parallel Sorting , 1995, J. Parallel Distributed Comput..

[16]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[17]  D. Martin Swany,et al.  An automated approach to improve communication-computation overlap in clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[19]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[21]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[22]  Walter F. Tichy,et al.  Measuring HPC productivity , 2004 .