Remote Store Programming

This paper presents remote store programming (RSP), a programming paradigm which combines usability and efficiency through the exploitation of a simple hardware mechanism, the remote store, which can easily be added to existing multicores. The RSP model and its hardware implementation trade a relatively high store latency for a low load latency because loads are more common than stores, and it is easier to tolerate store latency than load latency. This paper demonstrates the performance advantages of remote store programming by comparing it to cache-coherent shared memory (CCSM) for several important embedded benchmarks using the TILEPro64 processor. RSP is shown to be faster than CCSM for all eight benchmarks using 64 cores. For five of the eight benchmarks, RSP is shown to be more than 1.5 × faster than CCSM. For a 2D FFT implemented on 64 cores, RSP is over 3 × faster than CCSM. RSP's features, performance, and hardware simplicity make it well suited to the embedded processing domain.

[1]  Liviu Iftode,et al.  Software support for virtual memory-mapped communication , 1996, Proceedings of International Conference on Parallel Processing.

[2]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[3]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[4]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[5]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[6]  Milan M. Jovanovic,et al.  An overview of reflective memory systems , 1999, IEEE Concurr..

[7]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[8]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[9]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[10]  Jaswinder Pal Singh,et al.  A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors , 2001, International Journal of Parallel Programming.

[11]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[12]  Kai Li,et al.  Protected, user-level DMA for the SHRIMP network interface , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[13]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .