A Design of Performance-optimized Control-based Synchronization

A fundamental issue that any control-based synchronization should address is how to minimize both the overheads of the synchronization and the processor idling due to the variation in the arrival time of the synchronizing processors. This paper proposes two techniques to alleviate the above two problems in a large-scale shared-memory multiprocessor. First, the notion of delayed global-materialization is introduced, that tries to minimize the time spent by the synchronizing processors to globally materialize previously issued shared write references. This step is required before the processors participate in the actual synchronization step. The scheme is based on a compile-time analysis of parallel programs to identify the write references to the shared memory locations that will be accessed in the subsequent computational unit. The global-materialization for these write references is made immediately while that for other shared write references is done as lazily as possible. Second, a novel prefetching technique is proposed that allows prefetching across different computational units separated by a synchronization operation so as to keep the otherwise idling processors busy during synchronization. This scheme also requires a compile-time analysis to determine whether the prefetch request for a given shared read reference can be safely made across synchronization. The required hardware supports for the above two schemes are identified and the issues arising when the two techniques are used together are addressed.