论文信息 - Light NUCA: a proposal for bridging the inter-cache latency gap

Light NUCA: a proposal for bridging the inter-cache latency gap

To deal with the “memory wall” problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer granularity than conventional caches, reducing cache latency. Our evaluations show that in general, an L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.

[1] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .

[2] Bradford M. Beckmann,et al. TLC: transmission line caches , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[3] T. N. Vijaykumar,et al. Distance associativity for high-performance energy-efficient non-uniform cache architectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[4] Sharad Malik,et al. Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[5] R. Balasubramonian,et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[6] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[7] Ki Hwan Yum,et al. A Domain-Specific On-Chip Network Design for Large Scale Cache Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8] Rajeev Balasubramonian,et al. Interconnect design considerations for large NUCA caches , 2007, ISCA '07.

[9] Yuen H. Chan,et al. IBM POWER6 SRAM arrays , 2007, IBM J. Res. Dev..

[10] Norman P. Jouppi,et al. CACTI 5.0 , 2007 .

[11] Niraj K. Jha,et al. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[12] Brad Calder,et al. SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[13] Sharad Malik,et al. Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14] Lizy Kurian John,et al. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite , 2007, ISCA '07.

[15] C. Morganti,et al. The asynchronous 24MB on-chip level-3 cache for a dual-core Itanium/sup /spl reg//-family processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[16] William J. Dally,et al. A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[17] Simcha Gochman,et al. Introduction to Intel Core Duo Processor Architecture , 2006 .

[18] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[19] A. Kumary,et al. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007 .