The Execution Migration Machine

On-chip interconnect power already forms a significant portion of the power consumed by chip multiprocessors (CMPs), and with continued transistor scaling leading to higher and higher core counts, chip power will be increasingly dominated by the on-chip network. For massive multicores to be feasible, therefore, it will be necessary to significantly reduce total on-chip data movement. Since most of the traffic is related to bringing data to the locus of computation, one solution is to enable threads to efficiently migrate across the chip and execute near the data they access. In this paper, we present the detailed implementation of hardware-level instruction-granularity thread migration in a 110-core CMP. Implemented in 45nm ASIC technology, the chip occupies 100mm2 and is currently in the fabrication stage. With a custom stack-based ISA to enable partial context migration, when there is no network congestion, our implementation provides end-to-end migration latency of 4 cycles between neighboring cores with a minimum thread context, and 33 cycles between the farthest cores with a maximum context. To supplement a remote-cache-access-based shared memory paradigm, our cores learn a thread’s data access patterns and migrate threads automatically. Through RTL-level simulation, we demonstrate that migration can reduce on-chip data movement by up to 14× at a relatively small area cost of 23%.

[1]  Richard J. Lipton,et al.  A Massive Memory Machine , 1984, IEEE Transactions on Computers.

[2]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[3]  Srinivas Devadas,et al.  DIRECTORYLESS SHARED MEMORY COHERENCE USING EXECUTION MIGRATION , 2011 .

[4]  Srinivas Devadas,et al.  Deadlock-free fine-grained thread migration , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[5]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[6]  Wilson C. Hsieh,et al.  Computation migration: enhancing locality for distributed-memory parallel systems , 1993, PPOPP '93.

[7]  T. N. Vijaykumar,et al.  Distance associativity for high-performance energy-efficient non-uniform cache architectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[8]  David Wentzlaff,et al.  Energy characterization of a tiled architecture processor with on-chip networks , 2003, ISLPED '03.

[9]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[10]  Omer Khan,et al.  System-level Optimizations for Memory Access in the Execution Migration Machine ( EM 2 ) , 2011 .

[11]  Dean M. Tullsen,et al.  The shared-thread multiprocessor , 2008, ICS '08.

[12]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[13]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[14]  Robert Tappan Morris,et al.  Reinventing Scheduling for Multicore Systems , 2009, HotOS.

[15]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[16]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[17]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[18]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[20]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[21]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[22]  Pierre Michaud Exploiting the cache capacity of a single-chip multi-core processor with execution migration , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[23]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[24]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[25]  Srinivas Devadas,et al.  Thread Migration Prediction for Distributed Shared Caches , 2014, IEEE Computer Architecture Letters.

[26]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[27]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).