Thermostat: Application-transparent Page Management for Two-tiered Main Memory

The advent of new memory technologies that are denser and cheaper than commodity DRAM has renewed interest in two-tiered main memory schemes. Infrequently accessed application data can be stored in such memories to achieve significant memory cost savings. Past research on two-tiered main memory has assumed a 4KB page size. However, 2MB huge pages are performance critical in cloud applications with large memory footprints, especially in virtualized cloud environments, where nested paging drastically increases the cost of 4KB page management. We present Thermostat, an application-transparent huge-page-aware mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. We present an online page classification mechanism that accurately classifies both 4KB and 2MB pages as hot or cold while incurring no observable performance overhead across several representative cloud applications. We implement Thermostat in Linux kernel version 4.5 and evaluate its effectiveness on representative cloud computing workloads running under KVM virtualization. We emulate slow memory with performance characteristics approximating near-future high-density memory technology and show that Thermostat migrates up to 50% of application footprint to slow memory while limiting performance degradation to 3%, thereby reducing memory cost up to 30%.

[1]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[2]  Carlo Curino,et al.  OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases , 2013, Proc. VLDB Endow..

[3]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Jin Sun,et al.  Managing Hybrid Main Memories with a Page-Utility Driven Performance Model , 2015, ArXiv.

[5]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[6]  Per Stenström,et al.  A cost-effective main memory organization for future servers , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[7]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[8]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[9]  Tom Walsh,et al.  Generating Miss Rate Curves with Low Overhead Using Existing Hardware , 2010 .

[10]  Sanjeev Kumar,et al.  Dynamic tracking of page miss ratio curve for memory management , 2004, ASPLOS XI.

[11]  Sanjay Kumar,et al.  System software for persistent memory , 2014, EuroSys '14.

[12]  K QureshiMoinuddin,et al.  Scalable high performance main memory system using phase-change memory technology , 2009 .

[13]  Michael Stumm,et al.  Path: page access tracking to improve memory management , 2007, ISMM '07.

[14]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[16]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[17]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[18]  Michael M. Swift,et al.  BadgerTrap: a tool to instrument x86-64 TLB misses , 2014, CARN.

[19]  Fei Guo,et al.  Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi , 2015, VEE.

[20]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[21]  Josep Torrellas,et al.  Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery , 2017, IEEE Computer Architecture Letters.

[22]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[24]  Ada Gavrilovska,et al.  pVM: persistent virtual memory for efficient capacity scaling and object storage , 2016, EuroSys.

[25]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[26]  Xu Liu,et al.  memif: Towards Programming Heterogeneous Memory Asynchronously , 2016, ASPLOS.

[27]  Thomas F. Wenisch,et al.  Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[28]  Stéphane Eranian What can performance counters do for memory subsystem analysis? , 2008, MSPC '08.