Cyclical cascade chains: a dynamic barrier synchronization mechanism for multiprocessor systems

To achieve peak performance in a multiprocessor system, processor synchronization must be accomplished with minimal overhead. Static hardware barrier synchronization has become a popular mechanism for coordinating parallel processors [1]. Dynamic hardware barrier synchronization has typically been considered to require an extensive amount of overhead as compared to static barrier hardware, and so the potential performance increase it gives has been largely ignored [2, 3]. This paper describes a hardware implementation of a dynamic barrier synchronization mechanism with minimal overhead using “cyclical cascade chains.” It is shown that this design can be synthesized into a single FPGA for an inexpensive and efficient synchronization solution for clusters and multiprocessor computers. The design is proven to be deadlock-free and can synchronize 32 processors in an average of 196 nanoseconds. Scalability is addressed for all aspects of the design relevant to its synthesis.