A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality

Current data cache organizations fail to deliver high performance in scalar processors for many vector applications. There are two main reasons for this loss of performance: the use of the same organization for caching both spatial and temporal locality and the “eager” caching policy used by caches. The first issue has led to the well-known trade-off of designing caches with a line size of a few tens of bytes. However, for memory reference patterns with low spatial locality a significant pollution is introduced. On the other hand, when the spatial locality is very high, larger lines could be more convenient. The eager caching policy refers to the fact that data that miss in the cache and is required by the processor is always cached (excepting writes in a no write allocate cache). However, it is common in numerical applications to have large working sets (large vectors, larger than the cache size), that result on a swept of the cache without any opportunity to exploit temporal locality. In addition, they replace some other data that may be required later. In this paper, a novel data cache organization, called dual data cache, is presented. To our knowledge, this is the first time a cache with independent parts for managing spatial and temporal locality 1s proposed. In this way, both types of locality can be more efficiently exploited. In addition, the dual data cache implements a lazy caching policy, which tr]es not to cache something until a benefit (in terms of spatial or temporal locality) can be predicted. For both purposes, the dual data cache makes use of a locality prediction table, which is a history table with information about the most recently executed Ioad/store instructions. In addition, a simplified implementation of the dual data cache, which is called selective cache is also presented.

[1]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[2]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[3]  AyguadéEduard,et al.  Increasing the number of strides for conflict-free vector access , 1992 .

[4]  W C FuJohn,et al.  Stride directed prefetching in scalar processors , 1992 .

[5]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[6]  D. T. Harper,et al.  A dynamic storage scheme for conflict-free vector access , 1989, ISCA '89.

[7]  Qing Yang,et al.  A novel cache design for vector processing , 1992, ISCA '92.

[8]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[9]  Guang R. Gao,et al.  A design framework for hybrid-access caches , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[10]  Herbert H. J. Hum,et al.  A Design Framework for Hybrid-Access , 1995 .

[11]  David T. Harper,et al.  A Dynamic Storage Scheme For Conflict-free Vector Access , 1989, The 16th Annual International Symposium on Computer Architecture.

[12]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[13]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[14]  J. Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[15]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[16]  A. Agarwal,et al.  Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[17]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[18]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[19]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[20]  Ken Chan,et al.  PA7200: a PA-RISC processor with integrated high performance MP bus interface , 1994, Proceedings of COMPCON '94.