Exploiting Non-Uniform Access Time in Interconnect Sensitive Cache Partitioning

Growing wire delay and clock rates limit the amount of cache accessible within a single cycle [3,13]. Cache architectures assume that each level in the cache hierarchy require a uniform access time. As microprocessor technology advance, architects must decide how to best utilize increased resources while accounting for growing wire delays and clock rates. Because on-chip communication is very costly [14], accessing different physical locations of the cache can return a range of hit time latencies. This lack of uniformity can be exploited to provide faster access to cache blocks physically closest to processing elements. More cache is being placed on the chip causing the access time of the closest cache bank to be much less than the access time of the farthest cache bank. Previous research leveraged such non-uniformity by migrating the most likely to be used cache sets into the closer cache banks. This research work focuses on the placement of the cache banks and the interconnection topology that allows each bank to communicate with one another and the processor core. 4 This research evaluates the performance gain of non-uniform cache architectures, interconnected in a hypercube network, through a detailed cache model, an Alpha 21364 floorplan model and an out-of-order processor simulator. The research methodology generates various cache organizations and timing information given a variety of cache requirements. The cache organization is then manually laid out on the physical floorplan where global wire lengths are manually extracted and modeled in HSpice to obtain the latency due to global wire delay. The generated hit/miss access times along with the global wire latency are simulated with Simplescalar [11] and the SPEC2000 benchmark suite [9]. Initial results compare an S-NUCA cache with a mesh network to a D-NUCA cache with a torus, mesh and hypercube interconnection topology and demonstrate a 43% performance improvement.