Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories

Die-stacked DRAM is an emerging technology that is expected to be integrated in future systems with off-package memories resulting in a hybrid memory system. A large body of recent research has investigated the use of die-stacked dynamic random-access memory (DRAM) as a hardware-manged last-level cache. This approach comes at the costs of managing large tag arrays, increased hit latencies, and potentially significant increases in hardware verification costs. An alternative approach is for the operating system (OS) to manage the die-stacked DRAM as a page cache for off-package memories. However, recent work in OS-managed page cache focuses on FIFO replacement and related variants as the baseline management policy. In this paper, we take a step back and investigate classical OS page replacement policies and re-evaluate them for hybrid memories. We find that when we use different die-stacked DRAM sizes, the choice of best management policy depends on cache size and application, and can result in as much as a 13X performance difference. Furthermore, within a single application run, the choice of best policy varies over time. We also evaluate co-scheduled workload pairs and find that the best policy varies by workload pair and cache configuration, and that the best-performing policy is typically the most fair. Our research motivates us to continue our investigation for developing workload-aware and cache configuration-aware page cache management policies.

[1]  Dennis Shasha,et al.  2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm , 1994, VLDB.

[2]  Luca Faust,et al.  Modern Operating Systems , 2016 .

[3]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[4]  Till Straumann Open Source Real Time Operating Systems Overview , 2001, ArXiv.

[5]  Maya Gokhale,et al.  On the Role of NVRAM in Data-intensive Architectures: An Evaluation , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[7]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[9]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[10]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Erich Strohmaier,et al.  Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Maya Gokhale,et al.  DI-MMAP—a scalable memory-map runtime for out-of-core data-intensive applications , 2015, Cluster Computing.

[13]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[14]  Yangdong Deng,et al.  Interconnect characteristics of 2.5-D system integration scheme , 2001, ISPD '01.

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Martin Schulz,et al.  Practical performance prediction under Dynamic Voltage Frequency Scaling , 2011, 2011 International Green Computing Conference and Workshops.

[17]  Li Shen,et al.  Implementing a Leading Loads Performance Predictor on Commodity Processors , 2014, USENIX Annual Technical Conference.

[18]  Natalie D. Enright Jerger,et al.  A dual grain hit-miss detector for large Die-Stacked DRAM caches , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Ilya Sharapov,et al.  Characteristics of workloads used in high performance and technical computing , 2007, ICS '07.

[20]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[21]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Maya Gokhale,et al.  DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[23]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[24]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[25]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[26]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[27]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[28]  Maya Gokhale,et al.  DI-MMAP: A High Performance Memory-Map Runtime providing scalable out-of-core execution for Data-Intensive Applications , 2013 .

[29]  Mel Gorman,et al.  Understanding the Linux Virtual Memory Manager , 2004 .

[30]  Michael M. Swift,et al.  FlashVM: Revisiting the Virtual Memory Hierarchy , 2009, HotOS.

[31]  Gabriel H. Loh,et al.  Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems , 2012 .

[32]  Jörg Henkel,et al.  Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[33]  FalsafiBabak,et al.  Die-stacked DRAM caches for servers , 2013 .