Judicious Thread Migration When Accessing Distributed Shared Caches

Chip-multiprocessors (CMPs) have become the mainstream chip design in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Architecture (NUCA) design, where onchip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardwarelevel thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 18% on average and at best by 2.3X over the standard NUCA design that only uses remote accesses.

[1]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[2]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[3]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[4]  Richard J. Lipton,et al.  A Massive Memory Machine , 1984, IEEE Transactions on Computers.

[5]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[6]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[7]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[9]  Robert Tappan Morris,et al.  Reinventing Scheduling for Multicore Systems , 2009, HotOS.

[10]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[11]  Wilson C. Hsieh,et al.  Computation migration: enhancing locality for distributed-memory parallel systems , 1993, PPOPP '93.

[12]  Zeshan Chishti,et al.  Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures , 2003, MICRO.

[13]  Srinivas Devadas,et al.  Brief announcement: distributed shared memory based on computation migration , 2011, SPAA '11.

[14]  Pierre Michaud Exploiting the cache capacity of a single-chip multi-core processor with execution migration , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[15]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[17]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[18]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[19]  Srinivas Devadas,et al.  DIRECTORYLESS SHARED MEMORY COHERENCE USING EXECUTION MIGRATION , 2011 .

[20]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[21]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[22]  Srinivas Devadas,et al.  Deadlock-free fine-grained thread migration , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[23]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[24]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[25]  Omer Khan,et al.  System-level Optimizations for Memory Access in the Execution Migration Machine ( EM 2 ) , 2011 .

[26]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[27]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[28]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[29]  Coniferous softwood GENERAL TERMS , 2003 .

[30]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[31]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[32]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.