MemLiner: Lining up Tracing and Application for a Far-Memory-Friendly Runtime

Far-memory techniques that enable applications to use remote memory are increasingly appealing in modern data cen-ters, supporting applications’ large memory footprint and improving machines’ resource utilization. Unfortunately, most far-memory techniques focus on OS-level optimizations and are agnostic to managed runtimes and garbage collections (GC) underneath applications written in high-level languages. With different object-access patterns from applications, GC can severely interfere with existing far-memory techniques, breaking remote memory prefetching algorithms and causing severe local-memory misses. We developed MemLiner, a runtime technique that improves the performance of far-memory systems by “lining up” memory accesses from the application and the GC so that they follow similar memory access paths, thereby (1) reducing the local-memory working set and (2) improving remote-memory prefetching through simplified memory access patterns. We implemented MemLiner in two widely-used GCs in OpenJDK: G1 and Shenandoah. Our evaluation with a range of widely-deployed cloud systems shows MemLiner improves applications’ end-to-end performance by up to 2.5 × .

[1]  Yifan Qiao,et al.  Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory , 2022, NSDI.

[2]  Yifan Qiao,et al.  Mako: a low-pause, high-throughput evacuating collector for memory-disaggregated datacenters , 2022, PLDI.

[3]  Yutong Huang,et al.  Clio: a hardware-software co-designed disaggregated memory system , 2021, ASPLOS.

[4]  John N. Zigman,et al.  Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories , 2022, ACM Trans. Comput. Syst..

[5]  Onur Mutlu,et al.  Rethinking software runtimes for disaggregated memory , 2021, ASPLOS.

[6]  Marcos K. Aguilera,et al.  Can far memory improve job throughput? , 2020, EuroSys.

[7]  Mor Harchol-Balter,et al.  Borg: the next generation , 2020, EuroSys.

[8]  David Sidler,et al.  StRoM: smart remote memory , 2020, EuroSys.

[9]  Mark Silberstein,et al.  Lynx: A SmartNIC-driven Accelerator-centric Architecture for Network Servers , 2020, ASPLOS.

[10]  Mosharaf Chowdhury,et al.  Effectively Prefetching Remote Memory with Leap , 2019, USENIX ATC.

[11]  Siddhartha Sen,et al.  Disaggregation and the Application , 2019, HotCloud.

[12]  Marcos K. Aguilera,et al.  AIFM: High-Performance, Application-Integrated Far Memory , 2020, OSDI.

[13]  Binyu Zang,et al.  Platinum: A CPU-Efficient Concurrent Garbage Collector for Tail-Reduction of Interactive Services , 2020, USENIX Annual Technical Conference.

[14]  Miryung Kim,et al.  Semeru: A Memory-Disaggregated Managed Runtime , 2020, OSDI.

[15]  Joshua Fried,et al.  Caladan: Mitigating Interference at Microsecond Timescales , 2020, OSDI.

[16]  Miryung Kim,et al.  Gerenuk: thin computation over big native data using speculative program transformation , 2019, SOSP.

[17]  Onur Mutlu,et al.  Panthera: holistic memory management for big data processing over hybrid memories , 2019, PLDI.

[18]  Marcos K. Aguilera,et al.  Designing Far Memory Data Structures: Think Outside the Box , 2019, HotOS.

[19]  Jichuan Chang,et al.  Software-Defined Far Memory in Warehouse-Scale Computers , 2019, ASPLOS.

[20]  Hakim Weatherspoon,et al.  Shoal: A Network Architecture for Disaggregated Racks , 2019, NSDI.

[21]  Yiying Zhang,et al.  LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation , 2018, OSDI.

[22]  Haibo Chen,et al.  Espresso: Brewing Java For More Non-Volatility with Non-volatile Memory , 2017, ASPLOS.

[23]  Kejiang Ye,et al.  Imbalance in the cloud: An analysis on Alibaba cluster trace , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[24]  Amanda Carbonari,et al.  Tolerating Faults in Disaggregated Datacenters , 2017, HotNets.

[25]  Marcos K. Aguilera,et al.  Remote memory in the age of fast networks , 2017, SoCC.

[26]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[27]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[28]  Lu Fang,et al.  Yak: A High-Performance Big-Data-Friendly Garbage Collector , 2016, OSDI.

[29]  Andrew Dinn,et al.  Shenandoah: An open-source concurrent compacting garbage collector for OpenJDK , 2016, PPPJ.

[30]  Sparsh Mittal,et al.  A Survey of Recent Prefetching Techniques for Processor Caches , 2016, ACM Comput. Surv..

[31]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[32]  John Kubiatowicz,et al.  Taurus: A Holistic Language Runtime System for Coordinating Distributed Managed-Language Applications , 2016, ASPLOS.

[33]  Lu Fang,et al.  Interruptible tasks: treating memory pressure as interrupts for highly scalable data-parallel programs , 2015, SOSP.

[34]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[35]  Kimberly Keeton,et al.  The Machine: An Architecture for Memory-centric Computing , 2015, ROSS@HPDC.

[36]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37]  Lu Fang,et al.  FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications , 2015, ASPLOS.

[38]  Nhan Nguyen,et al.  NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines , 2015, ASPLOS.

[39]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[40]  Michael Kaminsky,et al.  Using RDMA efficiently for key-value services , 2014, SIGCOMM.

[41]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[42]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[43]  Krste Asanovic,et al.  FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers , 2014 .

[44]  Scott Shenker,et al.  Network support for resource disaggregation in next-generation datacenters , 2013, HotNets.

[45]  Engin Ipek,et al.  PARDIS: A programmable memory controller for the DDRx interfacing standards , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[46]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[47]  Richard E. Jones,et al.  The Garbage Collection Handbook: The art of automatic memory management , 2011, Chapman and Hall / CRC Applied Algorithms and Data Structures Series.

[48]  L. Barroso Warehouse-Scale Computing: Entering the Teenage Decade , 2011, SIGARCH Comput. Archit. News.

[49]  Michael Wolf,et al.  C4: the continuously concurrent compacting collector , 2011, ISMM '11.

[50]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[51]  Edith Schonberg,et al.  Finding low-utility data structures , 2010, PLDI '10.

[52]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[53]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[54]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[55]  Erez Petrank,et al.  The Compressor: concurrent, incremental, and parallel compaction , 2006, PLDI '06.

[56]  Michael Wolf,et al.  The pauseless GC algorithm , 2005, VEE '05.

[57]  David Detlefs,et al.  Garbage-first garbage collection , 2004, ISMM '04.

[58]  Taiichi Yuasa,et al.  Real-time garbage collection on general-purpose machines , 1990, J. Syst. Softw..