论文信息 - Exploiting Non-Uniform Access Time in Interconnect Sensitive Cache Partitioning

Exploiting Non-Uniform Access Time in Interconnect Sensitive Cache Partitioning

Growing wire delay and clock rates limit the amount of cache accessible within a single cycle [3,13]. Cache architectures assume that each level in the cache hierarchy require a uniform access time. As microprocessor technology advance, architects must decide how to best utilize increased resources while accounting for growing wire delays and clock rates. Because on-chip communication is very costly [14], accessing different physical locations of the cache can return a range of hit time latencies. This lack of uniformity can be exploited to provide faster access to cache blocks physically closest to processing elements. More cache is being placed on the chip causing the access time of the closest cache bank to be much less than the access time of the farthest cache bank. Previous research leveraged such non-uniformity by migrating the most likely to be used cache sets into the closer cache banks. This research work focuses on the placement of the cache banks and the interconnection topology that allows each bank to communicate with one another and the processor core. 4 This research evaluates the performance gain of non-uniform cache architectures, interconnected in a hypercube network, through a detailed cache model, an Alpha 21364 floorplan model and an out-of-order processor simulator. The research methodology generates various cache organizations and timing information given a variety of cache requirements. The cache organization is then manually laid out on the physical floorplan where global wire lengths are manually extracted and modeled in HSpice to obtain the latency due to global wire delay. The generated hit/miss access times along with the global wire latency are simulated with Simplescalar [11] and the SPEC2000 benchmark suite [9]. Initial results compare an S-NUCA cache with a mesh network to a D-NUCA cache with a torus, mesh and hypercube interconnection topology and demonstrate a 43% performance improvement.

W. Burleson | Christopher Cowell

[1] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[2] John Flynn,et al. Adapting the SPEC 2000 benchmark suite for simulation-based computer architecture research , 2001 .

[3] Shubhendu S. Mukherjee,et al. The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[4] Ken Mai,et al. The future of wires , 2001, Proc. IEEE.

[5] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[6] Kurt Keutzer,et al. Getting to the bottom of deep submicron , 1998, ICCAD '98.

[7] Doug Matzke,et al. Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.

[8] Péter Kacsuk,et al. Advanced computer architectures - a design space approach , 1997, International computer science series.

[9] Kenneth M. Wilson,et al. Designing High Bandwidth On-chip Caches , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10] Todd Austin. A user's and hacker's guide to the simplescalar architectural research tool set , 1997 .

[11] Richard Eugene Kessler. Analysis of multi-megabyte secondary CPU cache memories , 1992 .

[12] H. B. Bakoglu,et al. Circuits, interconnections, and packaging for VLSI , 1990 .