OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM Killer

Exploiting opportunistic memory by oversubscription is an appealing approach to improving cluster utilization and throughput. In this paper, we find the efficacy of memory oversubscription depends on whether or not the oversubscribed tasks can be killed by an OutOf Memory (OOM) killer in a timely manner to avoid significant memory thrashing upon memory pressure. However, current approaches in modern cluster schedulers are actually unable to unleash the power of opportunistic memory because their user space OOM killers are unable to timely deliver a task killing signal to terminate the oversubscribed tasks. Our experiments observe that a user space OOM killer fails to do that because of lacking the memory pressure knowledge from OS while the kernel space Linux OOM killer is too conservative to relieve memory pressure. In this paper, we design a user-assisted OOM killer (namely UA killer) in kernel space, an OS augmentation for accurate thrashing detection and agile task killing. To identify a thrashing task, UA killer features a novel mechanism, constraint thrashing. Upon UA killer, we develop Charon, a cluster scheduler for oversubscription of opportunistic memory in an on-demand manner. We implement Charon upon Mercury, a state-of-the-art opportunistic cluster scheduler. Extensive experiments with a Google trace in a 26-node cluster show that Charon can: (1) achieve agile task killing, (2) improve the best-effort job throughput by 3.5X over Mercury while prioritizing the production jobs, and (3) improve the 90th job completion time of production jobs over Kubernetes opportunistic scheduler by 62%.

[1]  Zhengping Qian,et al.  Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters , 2017, EuroSys.

[2]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[3]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[4]  Xin He,et al.  Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[5]  Shan Lu,et al.  Understanding Real-World Timeout Problems in Cloud Server Systems , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[6]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[7]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[8]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[9]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[10]  Xiaobo Zhou,et al.  Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization , 2017, USENIX Annual Technical Conference.

[11]  Lu Fang,et al.  Interruptible tasks: treating memory pressure as interrupts for highly scalable data-parallel programs , 2015, SOSP.

[12]  Douglas Thain,et al.  A Lightweight Model for Right-Sizing Master-Worker Applications , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Sanjeev Kumar,et al.  Dynamic tracking of page miss ratio curve for memory management , 2004, ASPLOS XI.

[14]  Lu Fang,et al.  Yak: A High-Performance Big-Data-Friendly Garbage Collector , 2016, OSDI.

[15]  Nhan Nguyen,et al.  NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines , 2015, ASPLOS.

[16]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[17]  Gregory R. Ganger,et al.  Tributary: spot-dancing for elastic services with latency SLOs , 2018, USENIX ATC.

[18]  Ali Anwar,et al.  MOS: Workload-aware Elasticity for Cloud Object Stores , 2016, HPDC.

[19]  Weimin Zheng,et al.  Bidding for Highly Available Services with Low Price in Spot Instance Market , 2015, HPDC.

[20]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[21]  Michael Isard,et al.  Broom: Sweeping Out Garbage Collection from Big Data Systems , 2015, HotOS.

[22]  Peter R. Pietzuch,et al.  Medea: scheduling of long running applications in shared production clusters , 2018, EuroSys.

[23]  Xiaobo Zhou,et al.  Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications , 2019, SoCC.

[24]  Feng Liu,et al.  Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Willy Zwaenepoel,et al.  Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling , 2017, USENIX Annual Technical Conference.

[26]  Andrew A. Chien,et al.  MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface , 2017, SOSP.

[27]  Xiaobo Zhou,et al.  Characterizing Scheduling Delay for Low-Latency Data Analytics Workloads , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  Evgenia Smirni,et al.  CEDULE: A Scheduling Framework for Burstable Performance in Cloud Computing , 2018, 2018 IEEE International Conference on Autonomic Computing (ICAC).

[29]  Srikanth Kandula,et al.  Efficient queue management for cluster scheduling , 2016, EuroSys.

[30]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[31]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[32]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[33]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[34]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[35]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[36]  Michael J. Freedman,et al.  Riffle: optimized shuffle service for large-scale data analytics , 2018, EuroSys.

[37]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[38]  Li Zhang,et al.  MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[39]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[40]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[41]  Emery D. Berger,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 73 Redline: First Class Support for Interactivity in Commodity Operating Systems , 2022 .