A novel migration-based NUCA design for Chip Multiprocessors

Chip Multiprocessors (CMPs) and Non-Uniform Cache Architectures (NUCAs) represent two emerging trends in computer architecture. Targeting future CMP based systems with NUCA type L2 caches, this paper proposes a novel data migration algorithm for parallel applications and evaluates it. The goal of this migration scheme is to determine a suitable location for each data block within a large L2 space at any given point during execution. A unique characteristic of the proposed scheme is that it models the problem of optimal data placement in the L2 cache space as a two-dimensional post office placement problem, presents a practical architectural implementation of this model, and gives a detailed evaluation of the proposed implementation. In our experimental evaluation, we also compare our approach to a previously-proposed NUCA management scheme using applications from the specomp suite, oltp, specjbb, and specweb. These experiments show that our migration approach generates about 35% improvement, on average, in average L2 access latency over the previous migration scheme, and these L2 latency savings translate, on average, to 9.5% improvement in IPC (instructions per cycle).We also observed during our experiments that both the careful initial placement of data (which itself triggers migrations within the L2 space) and subsequent migrations (due to inter-processor data sharing) play an important role in achieving our performance improvements.

[1]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Huiyang Zhou,et al.  A case for fault tolerance and performance enhancement using chip multi-processors , 2006, IEEE Computer Architecture Letters.

[3]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[4]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[5]  Gabriel H. Loh,et al.  Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[7]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[8]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[9]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[10]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[11]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[12]  Kimming So,et al.  Cache Operations by MRU Change , 1988, IEEE Trans. Computers.

[13]  Ravi R. Iyer,et al.  CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[14]  Mahmut T. Kandemir,et al.  Enhancing L2 organization for CMPs with a center cell , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Glenn Reinman,et al.  Fast and fair: data-stream quality of service , 2005, CASES '05.

[16]  Margaret Martonosi,et al.  An Efficient, Practical Parallelization Methodology for Multicore Architecture Simulation , 2006, IEEE Computer Architecture Letters.

[17]  Mary Jane Irwin,et al.  A novel migration-based NUCA design for chip multiprocessors , 2008, HiPC 2008.

[18]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[19]  Richard McGowen Adaptive designs for power and thermal optimization , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[20]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[21]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[22]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[23]  Lixin Zhang,et al.  Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[24]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[25]  C. Nicopoulos,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, ISCA 2006.

[26]  Pradeep Dubey,et al.  Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[27]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[28]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[29]  Zeshan Chishti,et al.  Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures , 2003, MICRO.

[30]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[31]  Santosh G. Abraham,et al.  Effective instruction prefetching in chip multiprocessors for modern commercial applications , 2005, 11th International Symposium on High-Performance Computer Architecture.