Active memory techniques for ccNUMA multiprocessors

Our recent work on uniprocessor and single-node multiprocessor (SMP) active memory systems uses address remapping techniques in conjunction with extended cache coherence protocols to improve access locality in processor caches. We extend our previous work in this paper and introduce the novel concept of multi-node active memory systems. We present the design of multi-node active memory cache coherence protocols to help reduce remote memory latency and improve scalability of matrix transpose and parallel reduction on distributed shared memory (DSM) multiprocessors. We evaluate our design on seven applications through execution-driven simulation on small and medium-scale multiprocessors. On a 32-processor system, an active-memory optimized matrix transpose attains speedup from 1.53 to 2.01 while parallel reduction achieves speedup from 1.19 to 2.81 over normal parallel executions.

[1]  Thomas R. Gross,et al.  Architectural implications of a family of irregular applications , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[2]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[3]  Josep Torrellas,et al.  Architectural support for parallel reductions in scalable shared-memory multiprocessors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Mainak Chaudhuri,et al.  Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters , 2002, ISHPC.

[6]  Daehyun Kim,et al.  Leveraging cache coherence in active memory systems , 2002, ICS '02.

[7]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[8]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[11]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[12]  Daehyun Kim,et al.  Cache Coherence Protocol Design for Active Memory Systems , 2002, PDPTA.

[13]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[14]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[15]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.