Leveraging cache coherence in active memory systems

Active memory systems help processors overcome the memory wall when applications exhibit poor cache behavior. They consist of either active memory elements that perform data parallel computations in the memory system itself, or an active memory controller that supports address re-mapping techniques that improve data locality. Both active memory approaches create coherence problems---even on uniprocessor systems---since there are either additional processors operating on the data directly, or the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we propose a new technique to solve the coherence problem by extending the coherence protocol. Our active memory controller leverages and extends the coherence mechanism, so that re-mapping techniques work transparently on both uniprocessor and multiprocessor systems.We present a microarchitecture for an active memory controller with a programmable core and specialized hardware that accelerates cache line assembly and disassembly. We present detailed simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. In addition to uniprocessor speedup, we show single-node multiprocessor speedup for parallel active memory applications and discuss how the same controller architecture supports coherent multi-node systems called active memory clusters.

[1]  Koen De Bosschere,et al.  Differential FCM: increasing value prediction accuracy by improving table usage efficiency , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[2]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[5]  Reinhard C. Schumann,et al.  Design of the 21174 Memory Controller for DIGITAL Personal Workstations , 1997, Digit. Tech. J..

[6]  Maged M. Michael,et al.  High-throughput coherence controllers , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[7]  Josep Torrellas,et al.  Adaptively Mapping Code in an Intelligent Memory Architecture , 2000, Intelligent Memory Systems.

[8]  Todd C. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.

[9]  Josep Torrellas,et al.  Toward a cost-effective DSM organization that exploits processor-memory integration , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[10]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[11]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[12]  Mark Heinrich,et al.  FLASH vs. (simulated) FLASH: closing the simulation loop , 2000, SIGP.

[13]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[14]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[15]  Josep Torrellas,et al.  Architectural support for parallel reductions in scalable shared-memory multiprocessors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[17]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[18]  James E. Smith,et al.  Implementations of Context Based Value Predictors , 1997 .

[19]  Rajit Manohar,et al.  A Case For Asynchronous Active Memories , 2000 .

[20]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[21]  Frederic T. Chong,et al.  Cache Coherence in Intelligent Memory Systems , 2003, IEEE Trans. Computers.

[22]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[23]  Mainak Chaudhuri,et al.  Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters , 2002, ISHPC.

[24]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[25]  Sally A. McKee,et al.  Algorithmic foundations for a parallel vector access memory system , 2000, SPAA '00.

[26]  John B. Carter,et al.  Memory System Support for Dynamic Cache Line Assembly , 2000, Intelligent Memory Systems.

[27]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.

[28]  Mark D. Hill,et al.  Using prediction to accelerate coherence protocols , 1998, ISCA.

[29]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.