Optimizing partitioned global address space programs for cluster architectures

Unified Parallel C (UPC) is an example of a partitioned global address space language for high performance parallel computing. This programming model enables application to be written in a shared memory style, but still allows the programmer to control data layout and the assignment of work to processors. An open question is whether programs written in simple style, with fine-grained accesses and blocking communication, can achieve performance approaching that of hand-optimized code, especially for cluster environments with high network latencies. This dissertation proposes an optimization framework for UPC that automates the transformations currently performed manually by programmers. The goals of the optimizations are twofold: not only do we seek to aggregate fine-grained remote accesses to reduce the number and volume of message traffic, but we also want to overlap communication with computation to hide network latency. The framework first applies communication vectorization and strip-mining to optimize regular array accesses in loops. For irregular fine-grained accesses, we apply a partial redundancy elimination framework that also generates split-phase communication. The last phase targets the blocking bulk transfers in the program by utilizing runtime support to automatically schedule them and achieve overlap. Message aggregation is performed as part of the scheduling to further reduce the communication overhead. Finally, we present several techniques for optimizing the serial performance of a UPC program, in order to reduce both the overhead of UPC-specific constructs and our source-to-source translation. The optimization framework has been implemented in the Berkeley UPC compiler, and has been tested on a number of supercomputer clusters. The optimizations are validated on a variety of benchmarks exhibiting different communication patterns, from bulk synchronous benchmarks to dynamic shared data structure code. Experimental results reveal that our framework offers comparable performance to aggressive manual optimization, and can achieve significant speedup compared to the fine-grained and blocking communication code that programmers find much easier to implement. Our framework is completely transparent to the user, and therefore improves productivity by freeing programmers from the details of communication management.

[1]  Katherine A. Yelick,et al.  Titanium Performance and Potential: An NPB Experimental Study , 2005, LCPC.

[2]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[3]  John M. Mellor-Crummey,et al.  Effective communication coalescing for data-parallel applications , 2005, PPOPP.

[4]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.

[5]  Edith Schonberg,et al.  An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[6]  Costin Iancu,et al.  HUNTing the overlap , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[7]  Laurie J. Hendren,et al.  Locality Analysis for Parallel C Programs , 1999, IEEE Trans. Parallel Distributed Syst..

[8]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  John A. Chandy,et al.  The Paradigm Compiler for Distributed-Memory Multicomputers , 1995, Computer.

[10]  Raymond Lo,et al.  Strength Reduction via SSAPRE , 1998, CC.

[11]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[12]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[13]  Katherine A. Yelick,et al.  Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..

[14]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[15]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[16]  Lawrence Rauchwerger,et al.  Automatic Detection of Parallelism: A grand challenge for high performance computing , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[17]  Mahmut T. Kandemir,et al.  A global communication optimization technique based on data-flow analysis and linear algebra , 1999, TOPL.

[18]  Zhang Zhang,et al.  Benchmark measurements of current UPC platforms , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[19]  Jimmy Su,et al.  Making Sequential Consistency Practical in Titanium , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[20]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[21]  Liviu Iftode,et al.  Shared virtual memory: progress and challenges , 1999 .

[22]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[23]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[24]  Xin Yuan,et al.  STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.

[25]  Ahmad Faraj,et al.  Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[26]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[27]  Ching-Hsien Hsu,et al.  A Generalized Processor Mapping Technique for Array Redistribution , 2001, IEEE Trans. Parallel Distributed Syst..

[28]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[29]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[30]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[31]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[32]  David A. Padua,et al.  Basic compiler algorithms for parallel programs , 1999, PPoPP '99.

[33]  Katherine A. Yelick,et al.  Polynomial-Time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays , 2003, LCPC.

[34]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[35]  Tarek S. Abdelrahman,et al.  Computation-Communication Overlap on Network-of-Workstation Multiprocessors , 2001 .

[36]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[37]  Alexander Aiken,et al.  Type systems for distributed data structures , 2000, POPL '00.

[38]  David A. Padua,et al.  Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs , 1997, LCPC.

[39]  Dan Bonachea Proposal for extending the upc memory copy library functions and supporting extensions to gasnet , 2004 .

[40]  Mahmut T. Kandemir,et al.  Minimizing Data and Synchronization Costs in One-Way Communication , 2000, IEEE Trans. Parallel Distributed Syst..

[41]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[42]  Erich Strohmaier,et al.  Optimizing communication overlap for high-speed networks , 2007, PPoPP.

[43]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[44]  Prithviraj Banerjee,et al.  Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers , 1995, ICS '95.

[45]  Michael Wolfe,et al.  A New Approach to Array Redistribution: Strip Mining Redistribution , 1994, PARLE.

[46]  Jason Duell Pthreads or Processes : Which is Better for Implementing Global Address Space languages ? , 2007 .

[47]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[48]  Jong-Deok Choi,et al.  Global communication analysis and optimization , 1996, PLDI '96.

[49]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[50]  Weiyu Chen Building a Source-to-Source UPC-toC Translator , 2004 .

[51]  Bernard Tourancheau,et al.  The Design for a High-Performance MPI Implementation on the Myrinet Network , 1999, PVM/MPI.

[52]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[53]  Roland WestrelinLHPC,et al.  Modeling of a high speed network to maximize throughputperformance : the experience of BIP over MyrinetLoic , 1997 .

[54]  Katherine A. Yelick,et al.  Concurrency Analysis for Parallel Programs with Textually Aligned Barriers , 2005, LCPC.

[55]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[56]  Raymond Lo,et al.  A new algorithm for partial redundancy elimination based on SSA form , 1997, PLDI '97.

[57]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[58]  Laurie J. Hendren,et al.  Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..

[59]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[60]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[61]  John M. Mellor-Crummey,et al.  A Multi-Platform Co-Array Fortran Compiler , 2004, IEEE PACT.

[62]  Jimmy Su,et al.  Array prefetching for irregular array accesses in Titanium , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[63]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[64]  Monica S. Lam,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[65]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[66]  David Gay,et al.  Barrier inference , 1998, POPL '98.

[67]  Ken Kennedy,et al.  Efficient address generation for block-cyclic distributions , 1995, ICS '95.

[68]  Dhabaleswar K. Panda,et al.  Protocols and strategies for optimizing performance of remote memory operations on clusters , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[69]  Raymond Lo,et al.  Register promotion by sparse partial redundancy elimination of loads and stores , 1998, PLDI.

[70]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[71]  Geoffrey C. Fox,et al.  A Compilation Approach for Fortran 90D/HPF Compilers on Distributed Memory MIMD Computers , 1993 .

[72]  Raymond Lo,et al.  Effective Representation of Aliases and Indirect Memory Operations in SSA Form , 1996, CC.

[73]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[74]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[75]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.

[76]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[77]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[78]  Tarek A. El-Ghazawi,et al.  An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.

[79]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[80]  John M. Mellor-Crummey,et al.  Co-array Fortran Performance and Potential: An NPB Experimental Study , 2003, LCPC.

[81]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[82]  Jason Duell,et al.  An evaluation of current high-performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[83]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[84]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[85]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[86]  S. Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP’s , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[87]  William Pugh,et al.  Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures , 2003, LCPC.

[88]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[89]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[90]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[91]  Toshio Nakatani,et al.  Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[92]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..

[93]  Hans P. Zima,et al.  The cascade high productivity language , 2004 .

[94]  Manish Gupta,et al.  PARADIGM: a compiler for automatic data distribution on multicomputers , 1993, ICS '93.

[95]  C. Tseng,et al.  UPC Implementation of an Unbalanced Tree Search Benchmark , 2003 .

[96]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[97]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[98]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[99]  Katherine A. Yelick,et al.  Type Systems for Distributed Data Sharing , 2003, SAS.

[100]  Wei Chen,et al.  Message Strip-Mining Heuristics for High Speed Networks , 2004, VECPAR.

[101]  Michael Wolfe,et al.  Eeectiveness of Message Strip-mining for Regular and Irregular Communication , 1994 .