论文信息 - Compiling and optimizing for decoupled architectures

Compiling and optimizing for decoupled architectures

Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.

Alasdair Rawsthorne | Nigel Topham | Muriel Mewissen | Callum McLean | Peter Bird

[1] Alasdair Rawsthorne,et al. The effectiveness of decoupling , 1993, ICS '93.

[2] Wilfried Oed. Cray Y-MP C90: System features and early benchmark results (Short communication) , 1992, Parallel Comput..

[3] Nigel Topham,et al. The scalability of decoupled multiprocessors , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[4] Wm. A. Wulf. Evaluation of the WM architecture , 1992, ISCA '92.

[5] Andrew R. Pleszkun,et al. PIPE: a VLSI decoupled architecture , 1985, ISCA '85.

[6] Richard L. Sites,et al. Alpha Architecture Reference Manual , 1995 .

[7] Nigel P. Topham,et al. Performance of the decoupled ACRI-1 architecture: the perfect club , 1995, HPCN Europe.

[8] B. Ramakrishna Rau,et al. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[9] George Cybenko,et al. Supercomputer performance evaluation and the Perfect Benchmarks , 1990, ICS '90.

[10] James E. Smith,et al. The ZS-1 central processor , 1987, ASPLOS 1987.

[11] Peter Yan-Tek Hsu. Designing the TFP microprocessor , 1994, IEEE Micro.

[12] Jenq Kuen Lee,et al. Sigma II: A Tool Kit for Building Parallelizing Compilers and Performance Analysis Systems , 1992, Programming Environments for Parallel Computing.