Techniques utilizing memory reference characteristics for improved performance

This dissertation explores three aspects of reducing the memory latency by exploiting characteristics in the second-level cache miss stream. Accessing data from main memory is two orders of magnitude slower than from a register within the processor. Thus, reducing the main memory latency is paramount for continued overall processor performance improvement. The prevailing solution is to use a cache. Most of the cache research to date has concentrated on the either simple cache geometries, relatively small miss latencies, or used simple microarchitectures. With current trends in computer architecture, techniques demonstrated in the past may not be as effective. In the first part of the dissertation, I explore a mechanism for reducing the number of cache misses. Recognizing that there is opportunity to improve upon the traditional least recently used (LRU) replacement algorithm, I describe a new cache replacement mechanism, Reference Locality Replacement (RLR). RLR enables deviation from the strict LRU replacement priorities by allowing older cache lines predicted with having temporal locality to remain in the cache. The ability of RLR to reduce cache misses is demonstrated with both novel software and hardware-directed replacement policies. In the second part of the dissertation, I examine the capability of hardware prefetching techniques to hide the latency of cache misses. With an aggressive superscalar microarchitecture and contemporary main memory latencies, I demonstrate that prefetches need to be initiated more than one cache miss ahead in order to completely hide the memory latency. As a result, those prefetching strategies that only prefetch the next cache miss will not scale well as the memory gap continues to grow. I reconfirm the ability of stream buffers to prefetch effectively for scientific applications. In contrast, I show the inability of the Markov and linked data structure prefetchers to prefetch effectively in general. In the third part of the dissertation, I describe methods for reducing the main memory latency by exploiting the structure of memory devices. The structure of memory devices offers non-uniform access latencies. Using the device's large row buffer as a single-entry cache, the latency of memory reads is reduced by exploiting locality at a larger granularity. Effectively managing this faster access mode is demonstrated with two dynamic memory controllers that recognize the temporal and spatial locality in the cache miss stream.

[1]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[5]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[6]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[7]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[8]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[9]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[11]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[12]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[13]  Billy Garrett,et al.  RDRAMs: a new speed paradigm , 1994, Proceedings of COMPCON '94.

[14]  Alec Wolman,et al.  The structure and performance of interpreters , 1996, ASPLOS VII.

[15]  Sally A. McKee,et al.  Access ordering and memory-conscious cache utilization , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[16]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[17]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[19]  David J. Lilja,et al.  A compiler-assisted data prefetch controller , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[20]  Fu-Chieh Hsu,et al.  The ideal SoC memory: 1T-SRAM/sup TM/ , 2000, Proceedings of 13th Annual IEEE International ASIC/SOC Conference (Cat. No.00TH8541).

[21]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[22]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[23]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[24]  Michael L. Scott,et al.  Cache performance in vector supercomputers , 1994, Proceedings of Supercomputing '94.

[25]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[26]  Yasunao Katayama,et al.  A 22-ns 1-Mbit CMOS high-speed DRAM with address multiplexing , 1989 .

[27]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[28]  Thomas Alexander,et al.  Distributed prefetch-buffer/cache design for high performance memory systems , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[29]  John H. Zurawski,et al.  The Design and Verification of the AlphaStation 600 5-series Workstation , 1995, Digit. Tech. J..

[30]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[31]  Wen-mei W. Hwu,et al.  Run-Time Adaptive Cache Hierarchy Management via Reference Analysis , 1997, ISCA.

[32]  Dileep Bhandarkar,et al.  Performance characterization of the Pentium Pro processor , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[33]  Brian N. Bershad,et al.  Execution characteristics of desktop applications on Windows NT , 1998, ISCA.

[34]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[35]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[36]  James K. Archibald,et al.  Evaluating performance of prefetching second level caches , 1993, PERV.

[37]  D. Burger,et al.  Datascalar Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[38]  Alvin R. Lebeck,et al.  Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[39]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[40]  Shlomit S. Pinter,et al.  Tango: a hardware-based data prefetching technique for superscalar processors , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[41]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[42]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[43]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[44]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[45]  Karl Pettis,et al.  Profile guided code positioning , 1990, PLDI '90.

[46]  Goro Kitsukawa,et al.  A 23-ns 1-Mb BiCMOS DRAM , 1990 .

[47]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[48]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[49]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[50]  Mark J. Charney,et al.  Prefetching and memory system behavior of the SPEC95 benchmark suite , 1997, IBM J. Res. Dev..

[51]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[52]  Michael F. Deering,et al.  FBRAM: a new form of memory optimized for 3D graphics , 1994, SIGGRAPH.

[53]  Yasuhiro Konishi,et al.  A 100-MHz 4-Mb cache DRAM with fast copy-back scheme , 1992 .

[54]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[55]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[56]  James R. Goodman,et al.  Instruction Cache Replacement Policies and Organizations , 1985, IEEE Transactions on Computers.

[57]  P. Chow,et al.  Memory-system Design Considerations For Dynamically-scheduled Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[58]  James E. Smith,et al.  Performance Of Cached Dram Organizations In Vector Supercomputers , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[59]  Sharon E. Perl,et al.  Studies of Windows NT performance using dynamic execution traces , 1996, OSDI '96.

[60]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[61]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[62]  Craig B. Zilles Benchmark health considered harmful , 2001, CARN.

[63]  David W. Wall,et al.  Generation and analysis of very long address traces , 1990, ISCA '90.

[64]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[65]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[66]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[67]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[68]  S SohiGurindar Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[69]  Charles A. Hart CDRAM in a unified memory architecture , 1994, Proceedings of COMPCON '94.

[70]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.