Analysis of Performance Dependencies in NUCA-Based CMP Systems

Improvements in semiconductor nanotechnology have continuously provided a crescent number of faster and smaller per-chip transistors. Consequent classical techniques for boosting performance, such as the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. As a consequence, designers interests have shifted toward the implementation of systems with multiple cores per chip (Chip Multiprocessors, CMP). CMP systems typically adopt a large last-level-cache (LLC) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for such tolerating wire-delay effects on the overall performance. In this paper, we introduce our design for S-NUCA and D-NUCA cache memory systems, and we present an analysis of an 8-cpu CMP system with two levels of cache, in which the L1s are private, while the L2 is a NUCA shared among all cores. We considered two different system topologies (the first with the eight cpus connected to the NUCA at the same side -8p-, the second with half of the cpus on one side and the others at the opposite side -4+4p), and for all the configurations we evaluate the effectiveness of both the static and dynamic policies that have been proposed. Our results show that adopting a D-NUCA scheme with the 8p configuration is the best performing solution among all the considered configurations, and that for the 4+4p configuration the D-NUCA outperforms the S-NUCA in most of the cases. We highlight that performance are tied to both mapping strategy variations (Static and Dynamic) and topology changes. We also observe that bandwidth occupancy depends on both the NUCA policy and topology.

[1]  Avi Mendelson,et al.  CMP Implementation in Systems Based on the Intel Core Duo Processor , 2006 .

[2]  Uday Bondhugula,et al.  A Compile-Time Data Locality Optimization Framework for NUCA Chip Multiprocessors , 2008 .

[3]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[4]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[5]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[6]  Pierfrancesco Foglia,et al.  An Evaluation of Behaviors of S-NUCA CMPs Running Scientific Workload , 2009, 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools.

[7]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[8]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[9]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[10]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.

[11]  Alessandro Bardine,et al.  A power-efficient migration mechanism for D-NUCA caches , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[12]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[13]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[14]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[16]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[17]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[18]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[19]  Yu (Kevin) Cao,et al.  What is Predictive Technology Model (PTM)? , 2009, SIGD.

[20]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[21]  Pierfrancesco Foglia,et al.  Investigating Design Trade-Off in S-NUCA Based CMP Systems , 2009 .

[22]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[23]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[24]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[25]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).