Optimal multistream sequential prefetching in a shared cache

Prefetching is a widely used technique in modern data storage systems. We study the most widely used class of prefetching algorithms known as sequential prefetching. There are two problems that plague the state-of-the-art sequential prefetching algorithms: (i) cache pollution, which occurs when prefetched data replaces more useful prefetched or demand-paged data, and (ii) prefetch wastage, which happens when prefetched data is evicted from the cache before it can be used. A sequential prefetching algorithm can have a fixed or adaptive degree of prefetch and can be either synchronous (when it can prefetch only on a miss) or asynchronous (when it can also prefetch on a hit). To capture these distinctions we define four classes of prefetching algorithms: fixed synchronous (FS), fixed asynchronous (FA), adaptive synchronous (AS), and adaptive asynchronous (AsynchA). We find that the relatively unexplored class of AsynchA algorithms is in fact the most promising for sequential prefetching. We provide a first formal analysis of the criteria necessary for optimal throughput when using an AsynchA algorithm in a cache shared by multiple steady sequential streams. We then provide a simple implementation called AMP (adaptive multistream prefetching) which adapts accordingly, leading to near-optimal performance for any kind of sequential workload and cache size. Our experimental setup consisted of an IBM xSeries 345 dual processor server running Linux using five SCSI disks. We observe that AMP convincingly outperforms all the contending members of the FA, FS, and AS classes for any number of streams and over all cache sizes. As anecdotal evidence, in an experiment with 100 concurrent sequential streams and varying cache sizes, AMP surpasses the FA, FS, and AS algorithms by 29--172%, 12--24%, and 21--210%, respectively, while outperforming OBL by a factor of 8. Even for complex workloads like SPC1-Read, AMP is consistently the best-performing algorithm. For the SPC2 video-on-demand workload, AMP can sustain at least 25% more streams than the next best algorithm. Furthermore, for a workload consisting of short sequences, where optimality is more elusive, AMP is able to outperform all the other contenders in overall performance. Finally, we implemented AMP in the state-of-the-art enterprise storage system, the IBM system storage DS8000 series. We demonstrated that AMP dramatically improves performance for common sequential and batch processing workloads and delivers up to a twofold increase in the sequential read capacity.

[1]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[2]  Darrell D. E. Long,et al.  Exploring the Bounds of Web Latency Reduction from Caching and Prefetching , 1997, USENIX Symposium on Internet Technologies and Systems.

[3]  James K. Archibald,et al.  Multiple Prefetch Adaptive Disk Caching , 1993, IEEE Trans. Knowl. Data Eng..

[4]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[5]  Seh-Woong Jeong,et al.  Reducing cache pollution of prefetching in a small data cache , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[6]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[7]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[8]  Jim Griffioen,et al.  Reducing File System Latency using a Predictive Approach , 1994, USENIX Summer.

[9]  P. Krishnan,et al.  Practical prefetching via data compression , 1993 .

[10]  Dharmendra S. Modha,et al.  WOW: wise ordering for writes - combining spatial and temporal locality in non-volatile caches , 2005, FAST'05.

[11]  Per Stenström,et al.  Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[12]  Jesús Labarta,et al.  Linear aggressive prefetching: a way to increase the performance of cooperative caches , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[13]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[14]  Dharmendra S. Modha,et al.  SARC: Sequential Prefetching in Adaptive Replacement Cache , 2005, USENIX Annual Technical Conference, General Track.

[15]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[16]  Hui Lei,et al.  An analytical approach to file prefetching , 1997 .

[17]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[18]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1999, IEEE Trans. Computers.

[19]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[20]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[21]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[22]  Anna R. Karlin,et al.  Near-Optimal Parallel Prefetching and Caching , 2000, SIAM J. Comput..

[23]  Anna R. Karlin,et al.  Near-optimal parallel prefetching and caching , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[24]  Peter J. Varman,et al.  PC-OPT: Optimal Offline Prefetching and Caching for Parallel I/O Systems , 2002, IEEE Trans. Computers.

[25]  S GillBinny,et al.  Optimal multistream sequential prefetching in a shared cache , 2007 .

[26]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[27]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[28]  Ibm Redbooks IBM System Storage Ds8000 Series: Architecture and Implementation , 2007 .

[29]  Bruce McNutt,et al.  A Standard Test of I/O Cache , 2001, Int. CMG Conference.

[30]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[31]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[32]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[33]  P. Krishnan,et al.  Optimal prefetching via data compression , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[34]  Seung Ryoul Maeng,et al.  An adaptive sequential prefetching scheme in shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[35]  Chris Metcalf,et al.  Data Prefetching: A Cost/Performance Analysis , 1993 .

[36]  Peter Triantafillou,et al.  Hierarchical caching and prefetching for continuous media servers with smart disks , 2000, IEEE Concurr..

[37]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[38]  Brian N. Bershad,et al.  A trace-driven comparison of algorithms for parallel prefetching and caching , 1996, OSDI '96.

[39]  Stavros Harizopoulos,et al.  Hierarchical Caching and Prefetching for High Performance Continuous Media Servers with Smart Disks , 2000 .

[40]  Srinivas Devadas,et al.  Controlling Cache Pollution in Prefetching With Software-assisted Cache Replacement , 2005 .

[41]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[42]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS '90.

[43]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[44]  Carla Schlatter Ellis,et al.  Practical prefetching techniques for parallel file systems , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.