Re-NUCA: Boosting CMP Performance Through Block Replication

Chip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip transistors. The interests for CMPs grew up since classical techniques for boosting performance, e.g. the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. CMP systems generally adopt a large last-level-cache (LLC) (typically, L2 or L3) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for tolerating wire-delay effects on the overall performance. In this paper, we introduce a novel NUCA architecture, called Re-NUCA, specifically suited for (but not limited to) CMPs in which cores are placed at different sides of the shared cache. The idea is to allow shared blocks to be replicated inside the shared cache, in order to avoid the limitations to performance improvements that arise in classical D-NUCA caches due to the conflict hit problem. Our results show that Re-NUCA outperforms D-NUCA of more then 5% on average, but for those applications that strongly suffer from the conflict hit problem we observe performance improvements up to 15%.

[1]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[2]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[5]  Jung-Hsien Chiang,et al.  Neural and Fuzzy Methods in Handwriting Recognition , 1997, Computer.

[6]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[7]  Uday Bondhugula,et al.  A Compile-Time Data Locality Optimization Framework for NUCA Chip Multiprocessors , 2008 .

[8]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[9]  Avi Mendelson,et al.  CMP Implementation in Systems Based on the Intel Core Duo Processor , 2006 .

[10]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[11]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[12]  Pierfrancesco Foglia,et al.  An Evaluation of Behaviors of S-NUCA CMPs Running Scientific Workload , 2009, 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools.

[13]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[14]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[16]  Valentin Puente,et al.  ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[17]  Kourosh Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, SIGP.

[18]  Pierfrancesco Foglia,et al.  Analysis of Performance Dependencies in NUCA-Based CMP Systems , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[19]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[20]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[21]  Yu (Kevin) Cao,et al.  What is Predictive Technology Model (PTM)? , 2009, SIGD.

[22]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[23]  Alessandro Bardine,et al.  A power-efficient migration mechanism for D-NUCA caches , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[24]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[25]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[26]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[27]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[29]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[30]  Pierfrancesco Foglia,et al.  Investigating Design Trade-Off in S-NUCA Based CMP Systems , 2009 .