SCIMA-SMP: on-chip memory processor architecture for SMP

In this paper, we propose a processor architecture with programmable on-chip memory for a high-performance SMP (symmetric multi-processor) node named SCIMA-SMP (Software Controlled Integrated Memory Architecture for SMP) with the intent of solving the performance gap problem between a processor and off-chip memory. With special instructions which enable the explicit data transfer between on-chip memory and off-chip memory, this architecture is able to control the data transfer timing and its granularity by the application program, and the SMP bus is utilized efficiently compared with traditional cache-only architecture. Through the performance evaluation based on clock-level simulation for various HPC applications, we confirmed that this architecture largely reduces the bus access cycle by avoiding redundant data transfer and controlling the granularity of the data movement between on-chip and off-chip memory.

[1]  Sony’s Emotionally Charged Chip , 1999 .

[2]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[3]  Hiroshi Nakamura,et al.  Architecture and compiler co-optimization for high performance computing , 2002, International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[4]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[6]  Hiroshi Nakamura,et al.  Performance of lattice QCD programs on CP-PACS , 1999, Parallel Computing.

[7]  Hiroshi Nakamura,et al.  SCIMA: Software controlled integrated memory architecture for high performance computing , 2000, Proceedings 2000 International Conference on Computer Design.

[8]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[9]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[10]  Hironori Kasahara,et al.  OSCAR multi-grain architecture and its evaluation , 1997, Proceedings Innovative Architecture for Future Generation High-Performance Processors and Systems.

[11]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  Stamatis Vassiliadis,et al.  Parallel Computer Architecture , 2000, Euro-Par.

[13]  Jeffrey B. Rothman,et al.  Analysis of shared memory misses and reference patterns , 2000, Proceedings 2000 International Conference on Computer Design.

[14]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[15]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[16]  Rajesh K. Gupta,et al.  Adapting cache line size to application behavior , 1999, ICS '99.