论文信息 - An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors

An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors

The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.

[1] Yunheung Paek,et al. Simplification of array access patterns for compiler optimizations , 1998, PLDI.

[2] Evangelos P. Markatos,et al. Shared memory vs. message passing in shared-memory multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[3] David A. Padua,et al. Access descriptor based locality analysis for Distributed-Shared Memory multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[4] Constantine D. Polychronopoulos,et al. The structure of parafrase-2: an advanced parallelizing compiler for C and FORTRAN , 1990 .

[5] Piyush Mehrotra,et al. Dynamic data distributions in Vienna Fortran , 1993, Supercomputing '93.

[6] Ken Kennedy,et al. Evaluating Compiler Optimizations for Fortran D , 1994, J. Parallel Distributed Comput..

[7] Jaspal Subhlok,et al. A new model for integrated nested task and data parallel programming , 1997, PPOPP '97.

[8] Yunheung Paek,et al. Compiler Techniques for E ective Communication on Distributed-Memory Multiprocessors , 1997 .

[9] Yunheung Paek,et al. Parallel Programming with Polaris , 1996, Computer.

[10] A. Steen. EuroBen Experiences with the SGI Origin 2000 and the Cray T , 1998 .

[11] W. Daniel Hillis,et al. The connection machine , 1985 .

[12] Seth Copen Goldstein,et al. Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[13] Irad Yavneh,et al. Implementation and Performance of a Grand Challenge 3d Quasi-Geostrophic Multi-Grid Code on the Cray T3D and IBM SP2 ; CU-CS-771-95 , 1995 .

[14] Ken Kennedy,et al. Automatic Data Layout for High Performance Fortran , 1995, SC.

[15] Andrew A. Chien,et al. A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16] Yunheung Paek,et al. Parallelization of benchmarks for scalable shared-memory multiprocessors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[17] Emilio L. Zapata,et al. An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSMM , 1999, LCPC.

[18] Robert J. Harrison,et al. Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[19] Katherine A. Yelick,et al. Optimizing Parallel SPMD Programs , 1994, LCPC.

[20] E. Ayguade,et al. A Novel Approach Towards Automatic Data Distribution , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[21] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.

[22] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[23] Monica S. Lam,et al. Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[24] Shahid H. Bokhari. Communication Overhead on the Intel Paragon, IBM SP2 and Meiko CS-2. , 1995 .

[25] Jay Hoeflinger,et al. Interprocedural parallelization using memory classification analysis , 1998 .

[26] John R. Gilbert,et al. The Alignment-Distribution Graph , 1993, LCPC.

[27] Bruno Raffin,et al. Comparing the Scalability of the Cray T3E-600 and the Cray Origin 2000 Using SHMEM Routines , 1998 .

[28] Seth Copen Goldstein,et al. Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[29] P. Sadayappan,et al. Communication-Free Hyperplane Partitioning of Nested Loops , 1991, LCPC.

[30] Paul Feautrier,et al. Direct parallelization of call statements , 1986, SIGPLAN '86.

[31] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[32] Bradford L. Chamberlain,et al. A Compiler Abstraction for Machine Independent Parallel Communication Generation , 1997, LCPC.

[33] William Pugh,et al. A practical algorithm for exact array dependence analysis , 1992, CACM.

[34] Yunheung Paek,et al. Experimental study of compiler techniques for NUMA machines , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[35] Bruno Raffin,et al. Comparing the communication performance and scalability of a Linux and a NT cluster of PCs, a Cray origin 2000, an IBM SP and a Cray T3E-600 , 1999, ICWC 99. IEEE Computer Society International Workshop on Cluster Computing.

[36] Glenn R. Lue,et al. Comparing the Communication Performance and Scalability of a SGI . . . , 1999 .

[37] Edith Schonberg,et al. Static analysis to reduce synchronization costs in data-parallel programs , 1996, POPL '96.

[38] Ken Kennedy,et al. An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[39] Yunheung Paek,et al. The Access Region Test , 1999, LCPC.

[40] Yunheung Paek,et al. Compiling for Distributed Memory Multiprocessors Based on Access Region Analysis , 1997 .

[41] Yunheung Paek,et al. Unified Interprocedural Parallelism Detection , 2001, International Journal of Parallel Programming.

[42] Irad Yavneh,et al. Implementation and Performance of a Grand Challenge 3d Quasi-Geostrophic Multi-Grid code on the Cray T3D and IBM SP2 , 1995, SC.

[43] Remzi H. Arpaci-Dusseau,et al. Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[44] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[45] Rice UniversityCORPORATE,et al. High performance Fortran language specification , 1993 .

[46] David A. Kendrick,et al. GAMS : a user's guide, Release 2.25 , 1992 .

[47] Saman Amarasinghe,et al. The suif compiler for scalable parallel machines , 1995 .