Performance Portable Optimizations for Loops Containing Communication Operations

As high end computing systems continue to scale in CPU computational power and overall node count, optimization techniques that can reduce communication overhead have proven important. We present a loop optimization framework designed to achieve both efficient communication/computation overlap and performance portability. The framework has been implemented in the Berkeley UPC compiler and uses a combination of compile time analysis and runtime mechanisms. We extend the compiler to perform message vectorization and message strip mining optimizations. At compile time loop nests are analyzed, their communication requirements are determined, and the computation overhead is estimated. The compiler passes analysis information to the runtime and performance portability is achieved by decoupling data movement from local computation. We generate template code that uses the transferred data without making any assumptions about the communication mechanism.

[1]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[2]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[3]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[4]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[5]  Lawrence Snyder,et al.  Quantifying the effects of communication optimizations , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[6]  Edith Schonberg,et al.  A Unified Framework for Optimizing Communication in Data-Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[7]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[8]  J. Mellor-Crummey,et al.  A multi-platform co-array Fortran compiler , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[9]  John M. Mellor-Crummey,et al.  Effective communication coalescing for data-parallel applications , 2005, PPOPP.

[10]  Dhabaleswar K. Panda,et al.  Zero-Copy MPI Derived Datatype Communication over InfiniBand , 2004, PVM/MPI.

[11]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[13]  Edith Schonberg,et al.  An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[14]  V. Tipparaju,et al.  Optimizing strided remote memory access operations on the Quadrics QsNetII network interconnect , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[15]  Mahmut T. Kandemir,et al.  Minimizing Data and Synchronization Costs in One-Way Communication , 2000, IEEE Trans. Parallel Distributed Syst..

[16]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[17]  Erich Strohmaier,et al.  Optimizing communication overlap for high-speed networks , 2007, PPoPP.

[18]  Prithviraj Banerjee,et al.  Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers , 1995, ICS '95.

[19]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[20]  Laurie J. Hendren,et al.  Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..

[21]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  Jimmy Su,et al.  Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[23]  Dan Bonachea Proposal for extending the upc memory copy library functions and supporting extensions to gasnet , 2004 .

[24]  Katherine Yelick,et al.  A proposal for a UPC memory consistency model, v1.0 , 2004 .

[25]  Mahmut T. Kandemir,et al.  A global communication optimization technique based on data-flow analysis and linear algebra , 1999, TOPL.

[26]  Yunheung Paek,et al.  Compiling for Distributed Memory Multiprocessors Based on Access Region Analysis , 1997 .

[27]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..

[28]  Jong-Deok Choi,et al.  Global communication analysis and optimization , 1996, PLDI '96.

[29]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[30]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[31]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[32]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[33]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.