The fuzzy barrier: a mechanism for high speed synchronization of processors

Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.

[1]  Rajiv Gupta A reconfigurable liw architecture and its compiler , 1987 .

[2]  Wei-Chung Hsu Register allocation and code scheduling for load/store architectures , 1987 .

[3]  Anita Osterhaug Guide to parallel programming on Sequent computer systems , 1989 .

[4]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[5]  Constantine D. Polychronopoulos Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design , 1988, IEEE Trans. Computers.

[6]  Rajiv Gupta,et al.  A Reconfigurable LIW Architecture , 1987, International Conference on Parallel Processing.

[7]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[8]  David A. Padua,et al.  Execution of Parallel Loops on Parallel Processor Systems , 1986, ICPP.

[9]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[10]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[11]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[12]  David A. Patterson,et al.  Reduced instruction set computers , 1985, CACM.

[13]  Thomas R. Gross,et al.  Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[14]  Michael Wolfe,et al.  Multiple Version Loops , 1987, ICPP.

[15]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[16]  Rajiv Gupta,et al.  Achieving low cost synchronization in a multiprocessor system , 1990, Future Gener. Comput. Syst..