论文信息 - A Design of Performance-optimized Control-based Synchronization

A Design of Performance-optimized Control-based Synchronization

A fundamental issue that any control-based synchronization should address is how to minimize both the overheads of the synchronization and the processor idling due to the variation in the arrival time of the synchronizing processors. This paper proposes two techniques to alleviate the above two problems in a large-scale shared-memory multiprocessor. First, the notion of delayed global-materialization is introduced, that tries to minimize the time spent by the synchronizing processors to globally materialize previously issued shared write references. This step is required before the processors participate in the actual synchronization step. The scheme is based on a compile-time analysis of parallel programs to identify the write references to the shared memory locations that will be accessed in the subsequent computational unit. The global-materialization for these write references is made immediately while that for other shared write references is done as lazily as possible. Second, a novel prefetching technique is proposed that allows prefetching across different computational units separated by a synchronization operation so as to keep the otherwise idling processors busy during synchronization. This scheme also requires a compile-time analysis to determine whether the prefetch request for a given shared read reference can be safely made across synchronization. The required hardware supports for the above two schemes are identified and the issues arising when the two techniques are used together are addressed.

[1] H. F. Jordan. A Special Purpose Architecture for Finite Element Analysis , 1978 .

[2] Janusz S. Kowalik,et al. Parallel MIMD computation : the HEP supercomputer and its applications , 1985 .

[3] Daniel Gajski,et al. CEDAR: a large scale multiprocessor , 1983, CARN.

[4] Pen-Chung Yew,et al. : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[5] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6] Janusz S. Kowalik. Use of Monitors in FORTRAN: A Tutorial on the Barrier, Self-scheduling DO-Loop, and Ask for Monitors , 1985 .

[7] Ralph Grishman,et al. The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract) , 1982, ISCA 1982.

[8] E. L. Lusk,et al. Use of monitors in FORTRAN: a tutorial on the barrier, self-scheduling DO-loop, and askfor monitors , 1985 .

[9] Rajiv Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS III.

[10] Kevin P. McAuliffe,et al. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[11] Sang Lyul Min,et al. Memory hierarchy management schemes in large-scale shared memory multiprocessors , 1989 .

[12] Pen-Chung Yew,et al. The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors , 1987 .