On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

[1]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[2]  Michael I. Gordon,et al.  Language and Compiler Design for Streaming Applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Kees Goossens,et al.  The future of computing : essays in memory of Stamatis Vassiliadis , 2007 .

[4]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[5]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[6]  Dionisios N. Pnevmatikatos,et al.  FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[7]  Guilherme Ottoni,et al.  Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  Josep Torrellas,et al.  Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs , 1998, ICS '98.

[9]  Pen-Chung Yew,et al.  Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[10]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.

[11]  William J. Dally,et al.  Architectural Support for the Stream Execution Model on General-Purpose Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[12]  Christoforos E. Kozyrakis,et al.  A memory system design framework: creating smart memories , 2009, ISCA '09.

[13]  Mendel Rosenblum,et al.  Streamware: programming general-purpose multicore processors using streams , 2008, ASPLOS.

[14]  David A. Patterson,et al.  Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments , 2009 .

[15]  Bronis R. de Supinski,et al.  CLOMP: Accurately Characterizing OpenMP Application Overheads , 2009, International Journal of Parallel Programming.

[16]  Manolis G. H. Katevenis SEEN AS LOAD-STORE INSTRUCTION GENERALIZATION , 2007 .

[17]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[18]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[19]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[20]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[21]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[22]  Wei Wu,et al.  On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator , 2008, IEEE Micro.

[23]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[24]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  William J. Dally,et al.  Concurrent Event Handling through Multithreading , 1999, IEEE Trans. Computers.