Profiler and compiler assisted adaptive I/O prefetching for shared storage caches

I/O prefetching has been employed in the past as one of the mechanisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this reduction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our proposed scheme improves performance, on average, by 19.9%, 11.9% and 10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used.

[1]  Brian N. Bershad,et al.  A trace-driven comparison of algorithms for parallel prefetching and caching , 1996, OSDI '96.

[2]  Gala Yadgar,et al.  Karma: Know-It-All Replacement for a Multilevel Cache , 2007, FAST.

[3]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[4]  Mahmut T. Kandemir,et al.  Discretionary Caching for I/O on Clusters , 2006, Cluster Computing.

[5]  Mahmut T. Kandemir,et al.  An Experimental Evaluation of I/O Optimizations on Different Applications , 2002, IEEE Trans. Parallel Distributed Syst..

[6]  Kai Shen,et al.  Competitive prefetching for concurrent sequential I/O , 2007, EuroSys '07.

[7]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[8]  Ravishankar K. Iyer,et al.  Experimental evaluation , 1995 .

[9]  Xiaoning Ding,et al.  DULO: an effective buffer cache management scheme to exploit both temporal and spatial locality , 2005, FAST'05.

[10]  Daniel A. Reed,et al.  Automatic ARIMA time series modeling for adaptive I/O prefetching , 2004, IEEE Transactions on Parallel and Distributed Systems.

[11]  Sang Lyul Min,et al.  On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies , 1999, SIGMETRICS '99.

[12]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[13]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[14]  Yuanyuan Zhou,et al.  Eviction-based Cache Placement for Storage Caches , 2003, USENIX Annual Technical Conference, General Track.

[15]  David E. Bernholdt,et al.  Erratum: An experimental evaluation of I/O optimizations on different applications (IEEE Transactions on Parallel and Distributed Systems) , 2002 .

[16]  Xiaoning Ding,et al.  DiskSeen: Exploiting Disk Layout and Access History to Enhance I/O Prefetch , 2007, USENIX Annual Technical Conference.

[17]  R. V. D. Wijngaart NAS Parallel Benchmarks Version 2.4 , 2022 .

[18]  John Wilkes,et al.  My Cache or Yours? Making Storage More Exclusive , 2002, USENIX Annual Technical Conference, General Track.

[19]  Jaejin Lee,et al.  Helper thread prefetching for loosely-coupled multiprocessor systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[21]  Dharmendra S. Modha,et al.  CAR: Clock with Adaptive Replacement , 2004, FAST.

[22]  Mahmut T. Kandemir,et al.  An Experimental Evaluation of I/O Optimizations on Different Applications , 2002, IEEE Trans. Parallel Distributed Syst..

[23]  Todd C. Mowry,et al.  Taming the memory hogs: using compiler-inserted releases to manage physical memory intelligently , 2000, OSDI.

[24]  John Paul Shen,et al.  Post-pass binary adaptation for software-based speculative precomputation , 2002, PLDI '02.

[25]  Luis Angel D. Bathen,et al.  AMP: Adaptive Multi-stream Prefetching in a Shared Cache , 2007, FAST.

[26]  Hongjun Lu,et al.  Improving I/O response times via prefetching and storage system reorganization , 1997, Proceedings Twenty-First Annual International Computer Software and Applications Conference (COMPSAC'97).

[27]  Todd C. Mowry,et al.  Automatic compiler-inserted I/O prefetching for out-of-core applications , 1996, OSDI '96.

[28]  Aamer Sachedina,et al.  Second-tier cache management using write hints , 2005, FAST'05.

[29]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[30]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[31]  Dheeraj Bhardwaj Application I/O on a Parallel File System for Linux Clusters , 2006 .

[32]  References , 1971 .

[33]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[34]  Michael Dahlin,et al.  Cooperative caching: using remote client memory to improve file system performance , 1994, OSDI '94.

[35]  Peter J. Varman,et al.  Optimal prefetching and caching for parallel I/O sytems , 2001, SPAA '01.

[36]  Kai Shen,et al.  Managing prefetch memory for data-intensive online servers , 2005, FAST'05.

[37]  Todd C. Mowry,et al.  Compiler-based I/O prefetching for out-of-core applications , 2001, TOCS.

[38]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[39]  William Pugh,et al.  Going Beyond Integer Programming with the Omega Test to Eliminate False Data Dependences , 1995, IEEE Trans. Parallel Distributed Syst..

[40]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[41]  A ReedDaniel,et al.  Automatic ARIMA Time Series Modeling for Adaptive I/O Prefetching , 2004 .

[42]  Rajeev Thakur,et al.  An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays , 1996, Sci. Program..

[43]  Robert A. van de Geijn,et al.  Parallel out-of-core cholesky and QR factorizations with POOCLAPACK , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[44]  Dennis Shasha,et al.  2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm , 1994, VLDB.

[45]  Rob VanderWijngaart,et al.  NAS Parallel Benchmarks I/O Version 2.4. 2.4 , 2002 .

[46]  Mahmut T. Kandemir,et al.  Discretionary Caching for I/O on Clusters , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[47]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[48]  Song Jiang,et al.  CLOCK-Pro: An Effective Improvement of the CLOCK Replacement , 2005, USENIX Annual Technical Conference, General Track.

[49]  Andrew Tomkins,et al.  Informed multi-process prefetching and caching , 1997, SIGMETRICS '97.

[50]  Song Jiang,et al.  LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.