论文信息 - Brief announcement: distributed shared memory based on computation migration

Brief announcement: distributed shared memory based on computation migration

Driven by increasingly unbalanced technology scaling and power dissipation limits, microprocessor designers have resorted to increasing the number of cores on a single chip, and pundits expect 1000-core designs to materialize in the next few years [1]. But how will memory architectures scale and how will these next-generation multicores be programmed? One barrier to scaling current memory architectures is the offchip memory bandwidth wall [1,2]: off-chip bandwidth grows with package pin density, which scales much more slowly than on-die transistor density [3]. To reduce reliance on external memories and keep data on-chip, today’s multicores integrate very large shared last-level caches on chip [4]; interconnects used with such shared caches, however, do not scale beyond relatively few cores, and the power requirements and access latencies of large caches exclude their use in chips on a 1000-core scale. For massive-scale multicores, then, we are left with relatively small per-core caches. Per-core caches on a 1000-core scale, in turn, raise the question of memory coherence. On the one hand, a shared memory abstraction is a practical necessity for general-purpose programming, and most programmers prefer a shared memory model [5]. On the other hand, ensuring coherence among private caches is an expensive proposition: bus-based and snoopy protocols don’t scale beyond relatively few cores, and directory sizes needed in cache-coherence protocols must equal a significant portion of the combined size of the per-core caches as otherwise directory evictions will limit performance [6]. Moreover, directory-based coherence protocols are notoriously difficult to implement and verify [7].

[1] Omer Khan,et al. System-level Optimizations for Memory Access in the Execution Migration Machine ( EM 2 ) , 2011 .

[2] Srinivas Devadas,et al. Deadlock-free fine-grained thread migration , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[3] George Kurian,et al. Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[4] Angela C. Sodan. Message-passing and shared-data programming models - wish vs. reality , 2005, 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05).

[5] Coniferous softwood. GENERAL TERMS , 2003 .

[6] Omer Khan,et al. EM2: A Scalable Shared-Memory Multicore Architecture , 2010 .

[7] Jr. Philip J. Koopman,et al. Stack computers: the new wave , 1989 .

[8] Anoop Gupta,et al. Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[9] David J. Lilja,et al. So many states, so little time: verifying memory coherence in the Cray X1 , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[10] Stefan Rusu,et al. A 45nm 8-core enterprise Xeon ® processor , 2009 .

[11] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12] Anoop Gupta,et al. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[13] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[14] D. Banks,et al. Assembly and Packaging , 2006 .

[15] Marcelo Cintra,et al. An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.