Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors

Run-time parallelization is often the only way to execute the code in parallel when data dependence information is incomplete at compile time. This situation is common in many important applications. Unfortunately, known techniques for run-time parallelization are often computationally expensive or not general enough. To address this problem, we propose new hardware support for efficient run-time parallelization in distributed shared-memory (DSM) multiprocessors. The idea is to execute the code in parallel speculatively and use extensions to the cache coherence protocol hardware to detect any dependence violations. As soon as a dependence is detected, execution stops, the state is restored, and the code is re-executed serially. This scheme, which we apply to loops, allows iterations to execute and complete in potentially any order. This scheme requires hardware extensions to the cache coherence protocol and memory hierarchy of a DSM. It has low overhead. We present the algorithms and a hardware design of the scheme. Overall, the scheme delivers average loop speedups of 7.3 for 16 processors and is 50% faster than a related software-only method.

[1]  Yunheung Paek,et al.  Advanced Program Restructuring for High-Performance Computers with Polaris , 2000 .

[2]  Lawrence Rauchwerger,et al.  The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization , 1994, ICS '94.

[3]  Schwetman Proceedings of the 1991 international conference on parallel processing , 1991 .

[4]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[5]  Kunle Olukotun,et al.  Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor , 1997 .

[6]  Ken Kennedy,et al.  The ParaScope parallel programming environment , 1993, Proc. IEEE.

[7]  A MahlkeScott,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994 .

[8]  Ding-Kai Chen,et al.  An Eecient Algorithm for the Run-time Parallelization of Doacross Loops 1 , 1994 .

[9]  Stephen R. Goldschmidt,et al.  Simulation of multiprocessors: accuracy and performance , 1993 .

[10]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[11]  Lawrence Rauchwerger,et al.  Run-time parallelization: A framework for parallel computation , 1995 .

[12]  Ken Kennedy,et al.  Parascope:a Parallel Programming Environment , 1988 .

[13]  Harry Berryman,et al.  Runtime Compilation Methods for Multicomputers , 1991, ICPP.

[14]  erDavid,et al.  Dynamic Memory Disambiguation Using the Memory Con ict Bu er , 1994 .

[15]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[16]  Gurindar S. Sohi,et al.  Speculative versioning cache , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[17]  Todd C. Mowry,et al.  The Potential for Thread-level Data Speculation in Tightly-coupled Multiprocessors , 1997 .

[18]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[19]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[20]  Jenn-Yuan Tsai,et al.  The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[21]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[22]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[23]  John Zahorjan,et al.  Improving the performance of runtime parallelization , 1993, PPOPP '93.

[24]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[25]  John Paul Shen,et al.  Speculative disambiguation: a compilation technique for dynamic memory disambiguation , 1994, ISCA '94.

[26]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[27]  Mateo Valero Proceedings of the 9th international conference on Supercomputing , 1995 .