论文信息 - On-chip communication and synchronization mechanisms with cache-integrated network interfaces

On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

[1] Christoforos E. Kozyrakis,et al. Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[2] Michael I. Gordon,et al. Language and Compiler Design for Streaming Applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3] Kees Goossens,et al. The future of computing : essays in memory of Stamatis Vassiliadis , 2007 .

[4] Sarita V. Adve,et al. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[5] James R. Larus,et al. Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[6] Dionisios N. Pnevmatikatos,et al. FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[7] Guilherme Ottoni,et al. Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8] Josep Torrellas,et al. Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs , 1998, ICS '98.

[9] Pen-Chung Yew,et al. Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[10] K. Gharachorloo,et al. Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.

[11] William J. Dally,et al. Architectural Support for the Stream Execution Model on General-Purpose Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[12] Christoforos E. Kozyrakis,et al. A memory system design framework: creating smart memories , 2009, ISCA '09.

[13] Mendel Rosenblum,et al. Streamware: programming general-purpose multicore processors using streams , 2008, ASPLOS.

[14] David A. Patterson,et al. Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments , 2009 .

[15] Bronis R. de Supinski,et al. CLOMP: Accurately Characterizing OpenMP Application Overheads , 2009, International Journal of Parallel Programming.

[16] Manolis G. H. Katevenis. SEEN AS LOAD-STORE INSTRUCTION GENERALIZATION , 2007 .

[17] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[18] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[19] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[20] James R. Larus,et al. Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[21] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[22] Wei Wu,et al. On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator , 2008, IEEE Micro.

[23] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[24] Norman P. Jouppi,et al. Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25] William J. Dally,et al. Concurrent Event Handling through Multithreading , 1999, IEEE Trans. Computers.