HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors

The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it difficult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache's internal wires. Traditionally, NUCA organizations have been classified as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain significant performance benefits, they used a non-affordable access scheme mechanism to find data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (\emph{Home Knows where to find data within the NUCA cache}). It exploits migration features by providing fast and power efficient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an efficient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40\% energy consumed per each memory request, and achieves an average performance improvement of 6%.

[1]  Mahmut T. Kandemir,et al.  Ring data location prediction scheme for Non-Uniform Cache Architectures , 2008, 2008 IEEE International Conference on Computer Design.

[2]  Rami G. Melhem,et al.  ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors , 2009, HiPEAC.

[3]  Valentin Puente,et al.  SP-NUCA: a cost effective dynamic non-uniform cache architecture , 2008, CARN.

[4]  Rami G. Melhem,et al.  Dynamic cache clustering for chip multiprocessors , 2009, ICS '09.

[5]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[6]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Antonio González,et al.  Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs , 2009, Euro-Par.

[8]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Zeshan Chishti,et al.  Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures , 2003, MICRO.

[10]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[11]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[12]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[13]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[14]  Rajeev Balasubramonian,et al.  Interconnect design considerations for large NUCA caches , 2007, ISCA '07.

[15]  Mahmut T. Kandemir,et al.  A novel migration-based NUCA design for Chip Multiprocessors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Alessandro Bardine,et al.  Analysis of static and dynamic energy consumption in NUCA caches: initial results , 2007, MEDEA '07.

[17]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[18]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[19]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[20]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[21]  Antonio González,et al.  LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors , 2009, 2009 IEEE International Conference on Computer Design.

[22]  John Paul Shen,et al.  Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[23]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[24]  Antonio González,et al.  The auction: optimizing banks usage in Non-Uniform Cache Architectures , 2010, ICS '10.

[25]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.