DIRECTORYLESS SHARED MEMORY COHERENCE USING EXECUTION MIGRATION

We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance. EM area and energy consumption are very competitive, and, on the average, it outperforms a directory-based MOESI baseline by 1.3 and a traditional S-NUCA design by 1.2 . We argue that with EM scaling performance has much lower cost and design complexity than in directorybased coherence and traditional NUCA architectures: by merely scaling network bandwidth from 256 to 512 bit flits, the performance of our architecture improves by an additional 13%, while the baselines show negligible improvement.

[1]  Anant Agarwal,et al.  Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures , 2008 .

[2]  Robert Tappan Morris,et al.  Reinventing Scheduling for Multicore Systems , 2009, HotOS.

[3]  Mahmut T. Kandemir,et al.  A novel migration-based NUCA design for Chip Multiprocessors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Srinivas Devadas,et al.  Deadlock-free fine-grained thread migration , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[5]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[6]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[7]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[8]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[9]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[10]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[11]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[12]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[14]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[15]  Wilson C. Hsieh,et al.  Computation migration: enhancing locality for distributed-memory parallel systems , 1993, PPOPP '93.

[16]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[17]  Marcelo Cintra,et al.  An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[18]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Omer Khan,et al.  System-level Optimizations for Memory Access in the Execution Migration Machine ( EM 2 ) , 2011 .

[20]  Michael D. Noakes,et al.  The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[21]  Pierre Michaud Exploiting the cache capacity of a single-chip multi-core processor with execution migration , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[22]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[24]  Dean M. Tullsen,et al.  Proximity-aware directory-based coherence for multi-core processor architectures , 2007, SPAA '07.

[25]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[26]  Jean-Luc Gaudiot,et al.  Nomadic Threads: a migrating multithreaded approach to remote memory accesses in multiprocessors , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[27]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[28]  Jean-Luc Gaudiot,et al.  An evaluation of thread migration for exploiting distributed array locality , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[29]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[30]  D. Banks,et al.  Assembly and Packaging , 2006 .

[31]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[33]  Coniferous softwood GENERAL TERMS , 2003 .

[34]  Sandhya Dwarkadas,et al.  SPACE: Sharing pattern-based directory coherence for multicore scalability , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[36]  Richard J. Lipton,et al.  A Massive Memory Machine , 1984, IEEE Transactions on Computers.

[37]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[38]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[39]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[40]  A. Kumary,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .