An Evaluation of Behaviors of S-NUCA CMPs Running Scientific Workload

Modern systems are able to put two or more processors on the same die (Chip Multiprocessors, CMP), each with its private caches, while the last level caches can be either private or shared. As these systems are affected by the wire delay problem, NUCA caches have been proposed to hide the effects of such delay in order to increase performance. A CMP system that adopt a NUCA as its shared last level cache has to be able to maintain coherence among the lowest, private levels of the cache hierarchy. As NUCA caches typically adopt a NoC as the communication infrastructure (in which the communication paradigm is message-passing), the coherence protocol has to be directory based, similar to the ones proposed for classical DSM systems. Previous works focusing on NUCA-based CMP systems adopt a fixed topology (i.e. physical position of cores and NUCA banks, and the communication infrastructure) each adopting different coherence strategies. In this paper, we present an evaluation of an 8-cpu CMP system with two levels of cache, in which the L1s are private of each core, while the L2 is a StaticNUCA shared among all cores. We considered two different system topologies (the first with the eight cpus connected to the NUCA at the same side, the second with half of the cpus on one side and the others at the opposite side), and for all the topologies we considered MESI and MOESI. The results indicate that processor topology has much more effect on performance and NOC bandwidth utilization than the coherence protocol, as a consequence of data mapping and accesses’ distribution to the L2 cache that is not uniformly distributed to all the cache banks.

[1]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[2]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[3]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[4]  Avi Mendelson,et al.  CMP Implementation in Systems Based on the Intel Core Duo Processor , 2006 .

[5]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[8]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[9]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[10]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[11]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[13]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[15]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[17]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[18]  Yu (Kevin) Cao,et al.  What is Predictive Technology Model (PTM)? , 2009, SIGD.

[19]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[20]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[21]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[23]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[24]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.