SUDS: Primitive Mechanisms for Memory Dependence Speculation

As VLSI chip sizes and densities increase, it becomes possible to put many processing elements on a single chip and connect them together with a low latency communication network. In this paper we propose a software system, SUDS (Software Un-Do System), that leverages these resources using speculation to exploit parallelism in integer programs with many data dependences. We demonstrate that in order to achieve parallel speedups a speculation system must deliver memory request latencies lower than about 30 cycles. We give a cost breakdown for our current working implementation of SUDS that has a memory request latency that is nearly able to meet this goal. In addition, we identify the three primitive runtime operations that are necessary to efficiently parallelize these programs. The subsystems include (1) a fast communication path for true dependences within the program, (2) a method for renaming variables that have anti and output dependences and (3) a memory dependence speculation mechanism to guarantee that parallel accesses to global data structures don’t violate sequential program semantics. We find that these three subsystems do not interact, so that they can be implemented separately. Each subsystem is then simple enough that it can be built in software using only minimal hardware support. In this paper we focus on the memory dependence subsystem and demonstrate that it can be implemented using a simple but effective low-cost protocol.

[1]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[2]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[3]  Philip A. Bernstein,et al.  Timestamp-Based Algorithms for Concurrency Control in Distributed Database Systems , 1980, VLDB.

[4]  Philip A. Bernstein,et al.  Fundamental Algorithms for Concurrency Control in Distributed Database Systems. , 1980 .

[5]  H. T. Kung,et al.  On optimistic concurrency control , 1981 .

[6]  David P. Reed,et al.  Implementing atomic actions on decentralized data , 1983, TOCS.

[7]  J. Goodman Using cache memory to reduce processor-memory traffic , 1983, ISCA '83.

[8]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[9]  James Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA 1984.

[10]  James K. Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA '84.

[11]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[12]  Thomas F. Knight An architecture for mostly functional languages , 1986, LFP '86.

[13]  Pete Tinker,et al.  Parallel execution of sequential scheme with ParaTran , 1988, LISP and Functional Programming.

[14]  Alexandru Nicolau,et al.  Run-Time Disambiguation: Coping with Statically Unpredictable Dependencies , 1989, IEEE Trans. Computers.

[15]  Gurindar S. Sohi,et al.  The expandable split window paradigm for exploiting fine-grain parallelsim , 1992, ISCA '92.

[16]  Zhiyuan Li,et al.  Array privatization for parallel execution of loops , 1992 .

[17]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, International Conference on Supercomputing.

[18]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[19]  Karen Lee Pieper Parallelizing compilers: implementation and effectiveness , 1993 .

[20]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[21]  John Paul Shen,et al.  Speculative disambiguation: a compilation technique for dynamic memory disambiguation , 1994, ISCA '94.

[22]  John Paul Shen,et al.  Speculative disambiguation: a compilation technique for dynamic memory disambiguation , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[23]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.

[24]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[25]  Multiscalar processors , 1995, ISCA 1995.

[26]  Manoj Franklin Multi-Version Caches for Multiscalar Processors , 1995 .

[27]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[28]  Anne Rogers,et al.  Software Caching and Computation Migration in Olden , 1996, J. Parallel Distributed Comput..

[29]  Manoj Franklin,et al.  A study of dynamic scheduling techniques for multiscalar processors , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[30]  Gary S. Tyson,et al.  Improving the accuracy and performance of memory communication through renaming , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[31]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[32]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[33]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[34]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[35]  Josep Torrellas,et al.  Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[36]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[37]  Gurindar S. Sohi,et al.  Compiling for the multiscalar architecture , 1998 .

[38]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[39]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[40]  David J. Lilja,et al.  Coarse-grained speculative execution in shared-memory multiprocessors , 1998, ICS '98.

[41]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[42]  J. Goodman Using cache memory to reduce processor-memory traffic , 1983, ISCA '98.

[43]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[44]  Josep Torrellas,et al.  Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor , 1998, ICS '98.

[45]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[46]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..