论文信息 - Reducing cache misses through programmable decoders

Reducing cache misses through programmable decoders

Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-associative caches because of nonuniform accesses to the cache sets. The nonuniform accesses generate more cache misses in some sets, while other sets are underutilized. We propose to increase the decoder length and, hence, reduce the accesses to heavily used sets without dynamically detecting the cache set usage information. We increase the access to the underutilized cache sets by incorporating a replacement policy into the cache design using programmable decoders. On average, the proposed techniques achieve as low a miss rate as a traditional 4-way cache on all 26 SPEC2K benchmarks for the instruction and data caches, respectively. This translates into an average IPC improvement of 21.5 and 42.4% for SPEC2K integer and floating-point benchmarks, respectively. The B-Cache consumes 10.5% more power per access, but exhibits a 12% total memory access-related energy savings as a result of the miss rate reductions, and, hence, the reduction to applications' execution time. Compared with previous techniques that aim at reducing the miss rate of direct-mapped caches, our technique requires only one cycle to access all cache hits and has the same access time of a direct-mapped cache.

Chuanjun Zhang

[1] André Seznec,et al. A case for two-way skewed-associative caches , 1993, ISCA '93.

[2] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3] Norman P. Jouppi,et al. CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .

[4] Brad Calder,et al. Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[5] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[6] Jaejin Lee,et al. Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[7] David Blaauw,et al. Drowsy caches: simple techniques for reducing leakage power , 2002, ISCA.

[8] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[9] J.J. Navarro,et al. The Difference-Bit Cache , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[10] Tony Givargis. Improved indexing for cache miss reduction in embedded systems , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[11] Brian N. Bershad,et al. Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[12] Kaushik Roy,et al. Exploring high bandwidth pipelined cache architecture for scaled technology , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[13] Anant Agarwal,et al. Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[14] ZhangChuanjun. Reducing cache misses through programmable decoders , 2008 .

[15] Frank Vahid,et al. A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[16] Jih-Kwon Peir,et al. LRU-based column-associative caches , 1998, CARN.

[17] Margaret Martonosi,et al. Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[18] Chenxi Zhang,et al. Two fast and high-associativity cache schemes , 1997, IEEE Micro.

[19] Maurice V. Wilkes,et al. The memory wall and the CMOS end-point , 1995, CARN.

[20] Frank Vahid,et al. A Way-Halting Cache for Low-Energy High-Performance Systems , 2005, IEEE Computer Architecture Letters.

[21] Richard E. Kessler,et al. Inexpensive Implementations Of Set-Associativity , 1989, The 16th Annual International Symposium on Computer Architecture.

[22] Kanad Ghose,et al. Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[23] Jih-Kwon Peir,et al. Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[24] Chuanjun Zhang. Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders , 2006 .

[25] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[26] Lishing Liu. Cache designs with partial address matching , 1994, MICRO 27.

[27] A. Varadharajan,et al. A low-cost 300 MHz RISC CPU with attached media processor , 1998, 1998 IEEE International Solid-State Circuits Conference. Digest of Technical Papers, ISSCC. First Edition (Cat. No.98CH36156).

[28] Dirk Grunwald,et al. Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[29] Samuel D. Naffziger,et al. The implementation of the Itanium 2 microprocessor , 2002, IEEE J. Solid State Circuits.

[30] Ronald Fagin,et al. Cold-start vs. warm-start miss ratios , 1978, CACM.

[31] Aristides Efthymiou,et al. An adaptive serial-parallel CAM architecture for low-power cache blocks , 2002, ISLPED '02.

[32] H. Shinohara,et al. A divided word-line structure in the static RAM and its application to a 64K full CMOS RAM , 1983, IEEE Journal of Solid-State Circuits.

[33] Chuanjun Zhang. Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[34] Honesty C. Young,et al. Improving cache performance with balanced tag and data paths , 1996, ASPLOS VII.