Achieving Low Cost Synchronization in a Multiprocessor System

A barrier are a commonly used mechanism for synchronizing processors executing in parallel. A software implementation of the barrier mechanism using shared variables has two major drawbacks. First, the synchronization overhead is high and second, when a processor reaches the barrier it must idle until all processors reach the barrier. In this paper, the fuzzy barrier, a mechanism that avoids the above drawbacks, is presented. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by using software techniques to find useful instructions that can be executed by a processor while it awaits synchronization. The hardware implementation eliminates busy waiting at barriers, provides a mask that allows disjoint subsets of processors to synchronize simultaneously, and provides multiple barriers by associating a tag with a barrier. Compiler techniques are presented for constructing barrier regions which consist of instructions that a processor can execute while it is waiting for other processors to reach the barrier. The larger the barrier region, the more likely it is that none of the processors will have to stall. Initial observations show that barrier regions can be large and the use of program transformations can significantly increase their size.