AMP: Adaptive Multi-stream Prefetching in a Shared Cache

Prefetching is a widely used technique in modern data storage systems. We study the most widely used class of prefetching algorithms known as sequential prefetching. There are two problems that plague the state-of-the-art sequential prefetching algorithms: (i) cache pollution, which occurs when prefetched data replaces more useful prefetched or demand-paged data, and (ii) prefetch wastage, which happens when prefetched data is evicted from the cache before it can be used. A sequential prefetching algorithm can have a fixed or adaptive degree of prefetch and can be either synchronous (when it can prefetch only on a miss), or asynchronous (when it can also prefetch on a hit). To capture these distinctions we define four classes of prefetching algorithms: Fixed Synchronous (FS), Fixed Asynchronous (FA), Adaptive Synchronous (AS), and Adaptive Asynchronous (AA). We find that the relatively unexplored class of AA algorithms is in fact the most promising for sequential prefetching. We provide a first formal analysis of the criteria necessary for optimal throughput when using an AA algorithm in a cache shared by multiple steady sequential streams. We then provide a simple implementation called AMP, which adapts accordingly leading to near optimal performance for any kind of sequential workload and cache size. Our experimental set-up consisted of an IBM xSeries 345 dual processor server running Linux using five SCSI disks. We observe that AMP convincingly outperforms all the contending members of the FA, FS, and AS classes for any number of streams, and over all cache sizes. As anecdotal evidence, in an experiment with 100 concurrent sequential streams and varying cache sizes, AMP beats the FA, FS, and AS algorithms by 29-172%, 12-24%, and 21-210% respectively while outperforming OBL by a factor of 8. Even for complex workloads like SPC1-Read, AMP is consistently the best performing algorithm. For the SPC2 Video-on-Demand workload, AMP can sustain at least 25% more streams than the next best algorithm. Finally, for a workload consisting of short sequences, where optimality is more elusive, AMP is able to outperform all the other contenders in overall performance.

[1]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[2]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[3]  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[4]  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS.

[5]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[6]  Carla Schlatter Ellis,et al.  Practical prefetching techniques for parallel file systems , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[7]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[8]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[9]  P. Krishnan,et al.  Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[10]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[11]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[12]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[13]  P. Krishnan,et al.  Practical prefetching via data compression , 1993 .

[14]  Chris Metcalf,et al.  Data Prefetching: A Cost/Performance Analysis , 1993 .

[15]  Kenneth M. Curewitz,et al.  Practical Prefetching via Data Compression Practical Prefetching via Data Compression , 1993 .

[16]  James K. Archibald,et al.  Multiple Prefetch Adaptive Disk Caching , 1993, IEEE Trans. Knowl. Data Eng..

[17]  Jim Griffioen,et al.  Reducing File System Latency using a Predictive Approach , 1994, USENIX Summer.

[18]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[19]  Mikko H. Lipasti,et al.  SPAID: software prefetching in pointer- and call-intensive environments , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[20]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[21]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[22]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[23]  Per Stenström,et al.  Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[24]  Brian N. Bershad,et al.  A trace-driven comparison of algorithms for parallel prefetching and caching , 1996, OSDI '96.

[25]  Anna R. Karlin,et al.  Near-optimal parallel prefetching and caching , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[26]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[27]  P. Krishnan,et al.  Optimal prefetching via data compression , 1996, JACM.

[28]  Hui Lei,et al.  An analytical approach to file prefetching , 1997 .

[29]  Darrell D. E. Long,et al.  Exploring the Bounds of Web Latency Reduction from Caching and Prefetching , 1997, USENIX Symposium on Internet Technologies and Systems.

[30]  Seung Ryoul Maeng,et al.  An adaptive sequential prefetching scheme in shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[31]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[32]  Jesús Labarta,et al.  Linear aggressive prefetching: a way to increase the performance of cooperative caches , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[33]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1999, IEEE Trans. Computers.

[34]  Peter Triantafillou,et al.  Hierarchical caching and prefetching for continuous media servers with smart disks , 2000, IEEE Concurr..

[35]  Bruce McNutt,et al.  A Standard Test of I/O Cache , 2001, Int. CMG Conference.

[36]  Seh-Woong Jeong,et al.  Reducing cache pollution of prefetching in a small data cache , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[37]  Peter J. Varman,et al.  PC-OPT: Optimal Offline Prefetching and Caching for Parallel I/O Systems , 2002, IEEE Trans. Computers.

[38]  Dharmendra S. Modha,et al.  WOW: wise ordering for writes - combining spatial and temporal locality in non-volatile caches , 2005, FAST'05.

[39]  Dharmendra S. Modha,et al.  SARC: Sequential Prefetching in Adaptive Replacement Cache , 2005, USENIX Annual Technical Conference, General Track.

[40]  Srinivas Devadas,et al.  Controlling Cache Pollution in Prefetching With Software-assisted Cache Replacement , 2005 .