Improving cache performance via active management

This dissertation analyzes a way to improve cache performance via active management of a target cache space. As microprocessor speeds continue to grow faster than memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. Cache management involves two main processes: block allocation decisions and block replacement decisions. Active block allocation can be performed most efficiently in multilateral caches (several distinct data stores with disjoint contents placed in parallel within L1), where blocks exhibiting particular characteristics can be placed in the appropriate store. To aid in our evaluation of different active block management schemes, we have developed a multilateral cache simulator, mlcache, which provides a platform whereby different cache schemes can easily be specified, and produces evaluation statistics that can help explain their performance. Using mlcache, we have been able to evaluate the performance of proposed multilateral cache schemes and to derive new, better performing schemes. Our results show that multilateral schemes outperform traditional caches of similar size and often rival the performance of traditional caches nearly twice as large. However, the performance difference between previously-proposed implementable schemes and a multilateral configuration that uses a non-implementable near-optimal replacement policy is large. This disparity is due mainly to the simple prediction strategies presently used in the implementable schemes, along with their limited management of blocks while resident in the L1 cache structure. We introduce a new multilateral allocation scheme, Allocation By Conflict (ABC), which outperforms all previously proposed reuse-based multilateral configurations and performs comparably to multilateral schemes that have significantly more hardware requirements (particularly Victim, which requires a data path between its A and B caches). The ABC scheme incurs the lowest hardware cost of any of the proposed multilateral schemes, yet it performs the highest and is the most easily implementable. The ABC scheme requires the addition of only a single additional bit per block in cache A and a very simple logic circuit for making the allocation decisions. The ABC scheme's performance advantage also scales well as the sizes of the caches are increased and as the associativity of the A cache is increased.

[1]  Edward S. Davidson,et al.  Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[2]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[3]  Edward S. Davidson,et al.  The resource conflict methodology for early-stage design space exploration of superscalar RISC processors , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[4]  Antonio Gonzalez,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.

[5]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[6]  Mark J. Charney,et al.  Prefetching and memory system behavior of the SPEC95 benchmark suite , 1997, IBM J. Res. Dev..

[7]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[8]  Gary S. Tyson,et al.  Evaluating the performance of active cache management schemes , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[9]  Edward S. Davidson,et al.  On effective data supply for multi-issue processors , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[10]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[11]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[12]  Peter S. Magnusson,et al.  Efficient memory simulation in SimICS , 1995, Proceedings of Simulation Symposium.

[13]  Trevor N. Mudge,et al.  The bi-mode branch predictor , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  Henry G. Dietz,et al.  Improving cache performance by selective cache bypass , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[15]  Robert C. Bedichek Talisman: fast and accurate multicomputer simulation , 1995, SIGMETRICS '95/PERFORMANCE '95.

[16]  Ken Chan,et al.  PA7200: a PA-RISC processor with integrated high performance MP bus interface , 1994, Proceedings of COMPCON '94.

[17]  Edward S. Davidson,et al.  Early Design Cycle Timing Simulation of Caches , 1996 .

[18]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[19]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[20]  Edward S. Davidson,et al.  Flexible Timing Simulation of Multiple-Cache Configurations , 2000 .

[21]  Gary S. Tyson,et al.  mlcache: a flexible multi-lateral cache simulator , 1998, Proceedings. Sixth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.98TB100247).

[22]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[23]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[24]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[25]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[26]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[27]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[28]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[29]  Tien-Pao Shih Goal-directed performance tuning for scientific applications. , 1996 .

[30]  R. Rajamani,et al.  A CMOS RISC CPU with on-chip parallel cache , 1994, Proceedings of IEEE International Solid-State Circuits Conference - ISSCC '94.

[31]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[32]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[33]  Gary S. Tyson,et al.  Active Management of Data Caches by Exploiting Reuse Information , 1999, IEEE Trans. Computers.

[34]  Veljko M. Milutinovic,et al.  The cache injection/cofetch architecture: initial performance evaluation , 1997, Proceedings Fifth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[35]  Gary S. Tyson,et al.  On high-bandwidth data cache design for multi-issue processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.