SynchronizationCoherence:ATransparentHardwareMechanismfor CacheCoherenceandFine-GrainedSynchronization

The quest to improve performance forces designers to explore flner-grained multiprocessor machines. Ever increasing chip densities based on CMOS improvements fuel research in highly parallel chip multiprocessors with 100s of processing elements. With such increasing levels of parallelism, synchronization is set to become a major performance bottleneck and e‐cient support for synchronization an important design criterion. Previous research has shown that integrating support for flne-grained synchronization can have signiflcant performance beneflts compared to traditional coarse-grained synchronization. Not much progress has been made in supporting flne-grained synchronization transparently to processor nodes: a key reason perhaps why wide adoption has not followed. In this paper, we propose a novel approach called Synchronization Coherence that can provide transparent flnegrained synchronization and caching in a multiprocessor machine and single-chip multiprocessor. Our approach merges flne-grained synchronization mechanisms with traditional cache coherence protocols. It reduces network utilization as well as synchronization related processing overheads while adding minimal hardware complexity as compared to cache coherence mechanisms or previously reported flne-grained synchronization techniques. In addition to its beneflt of making synchronization transparent to processor nodes, for the applications studied, it provides up to 23% improvement in performance and up to 24% improvement in energy e‐ciency with no L2 caches compared to previous flne-grained synchronization techniques. The performance improvement increases up to 38% when simulating with an ideal L2 cache system.

[1]  Naraig Manjikian Multiprocessor enhancements of the SimpleScalar tool set , 2001, CARN.

[2]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[3]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[4]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[5]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[6]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[7]  Jon Louis Bentley,et al.  A Parallel Algorithm for Constructing Minimum Spanning Trees , 1980, J. Algorithms.

[8]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  Arvind,et al.  M-Structures: Extending a Parallel, Non-strict, Functional Language with State , 1991, FPCA.

[10]  Josep Torrellas,et al.  Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[11]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[12]  Michael Zhang,et al.  Highly-Associative Caches for Low-Power Processors , 2000 .

[13]  Josep Torrellas,et al.  Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[15]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[16]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[18]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[19]  Donald Yeung,et al.  The MIT Alewife Machine , 1999, Proc. IEEE.

[20]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[21]  Kunle Olukotun,et al.  Using thread-level speculation to simplify manual parallelization , 2003, PPoPP '03.

[22]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[23]  Josep Torrellas,et al.  Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[24]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[25]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[26]  Margaret Martonosi,et al.  Informing memory operations: memory performance feedback mechanisms and their applications , 1998, TOCS.

[27]  William J. Dally,et al.  Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[28]  Chung-Ta King,et al.  Designing Tree-Based Barrier Synchronization on 2D Mesh Networks , 1998, IEEE Trans. Parallel Distributed Syst..

[29]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[30]  Csaba Andras Moritz,et al.  LoGPC: Modeling Network Contention in Message-Passing Programs , 2001, IEEE Trans. Parallel Distributed Syst..

[31]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[32]  Donald Yeung,et al.  Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient , 1993, PPOPP '93.

[33]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).