An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors

The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.

[1]  Yunheung Paek,et al.  Simplification of array access patterns for compiler optimizations , 1998, PLDI.

[2]  Evangelos P. Markatos,et al.  Shared memory vs. message passing in shared-memory multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[3]  David A. Padua,et al.  Access descriptor based locality analysis for Distributed-Shared Memory multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[4]  Constantine D. Polychronopoulos,et al.  The structure of parafrase-2: an advanced parallelizing compiler for C and FORTRAN , 1990 .

[5]  Piyush Mehrotra,et al.  Dynamic data distributions in Vienna Fortran , 1993, Supercomputing '93.

[6]  Ken Kennedy,et al.  Evaluating Compiler Optimizations for Fortran D , 1994, J. Parallel Distributed Comput..

[7]  Jaspal Subhlok,et al.  A new model for integrated nested task and data parallel programming , 1997, PPOPP '97.

[8]  Yunheung Paek,et al.  Compiler Techniques for E ective Communication on Distributed-Memory Multiprocessors , 1997 .

[9]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[10]  A. Steen EuroBen Experiences with the SGI Origin 2000 and the Cray T , 1998 .

[11]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[12]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[13]  Irad Yavneh,et al.  Implementation and Performance of a Grand Challenge 3d Quasi-Geostrophic Multi-Grid Code on the Cray T3D and IBM SP2 ; CU-CS-771-95 , 1995 .

[14]  Ken Kennedy,et al.  Automatic Data Layout for High Performance Fortran , 1995, SC.

[15]  Andrew A. Chien,et al.  A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  Yunheung Paek,et al.  Parallelization of benchmarks for scalable shared-memory multiprocessors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[17]  Emilio L. Zapata,et al.  An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSMM , 1999, LCPC.

[18]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[19]  Katherine A. Yelick,et al.  Optimizing Parallel SPMD Programs , 1994, LCPC.

[20]  E. Ayguade,et al.  A Novel Approach Towards Automatic Data Distribution , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[21]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[22]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[23]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[24]  Shahid H. Bokhari Communication Overhead on the Intel Paragon, IBM SP2 and Meiko CS-2. , 1995 .

[25]  Jay Hoeflinger,et al.  Interprocedural parallelization using memory classification analysis , 1998 .

[26]  John R. Gilbert,et al.  The Alignment-Distribution Graph , 1993, LCPC.

[27]  Bruno Raffin,et al.  Comparing the Scalability of the Cray T3E-600 and the Cray Origin 2000 Using SHMEM Routines , 1998 .

[28]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[29]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1991, LCPC.

[30]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[31]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[32]  Bradford L. Chamberlain,et al.  A Compiler Abstraction for Machine Independent Parallel Communication Generation , 1997, LCPC.

[33]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[34]  Yunheung Paek,et al.  Experimental study of compiler techniques for NUMA machines , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[35]  Bruno Raffin,et al.  Comparing the communication performance and scalability of a Linux and a NT cluster of PCs, a Cray origin 2000, an IBM SP and a Cray T3E-600 , 1999, ICWC 99. IEEE Computer Society International Workshop on Cluster Computing.

[36]  Glenn R. Lue,et al.  Comparing the Communication Performance and Scalability of a SGI . . . , 1999 .

[37]  Edith Schonberg,et al.  Static analysis to reduce synchronization costs in data-parallel programs , 1996, POPL '96.

[38]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[39]  Yunheung Paek,et al.  The Access Region Test , 1999, LCPC.

[40]  Yunheung Paek,et al.  Compiling for Distributed Memory Multiprocessors Based on Access Region Analysis , 1997 .

[41]  Yunheung Paek,et al.  Unified Interprocedural Parallelism Detection , 2001, International Journal of Parallel Programming.

[42]  Irad Yavneh,et al.  Implementation and Performance of a Grand Challenge 3d Quasi-Geostrophic Multi-Grid code on the Cray T3D and IBM SP2 , 1995, SC.

[43]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[44]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[45]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[46]  David A. Kendrick,et al.  GAMS : a user's guide, Release 2.25 , 1992 .

[47]  Saman Amarasinghe,et al.  The suif compiler for scalable parallel machines , 1995 .