Improving Preemptive Scheduling with Application-Transparent Checkpointing in Shared Clusters

Modern data center clusters are shifting from dedicated single framework clusters to shared clusters. In such shared environments, cluster schedulers typically utilize preemption by simply killing jobs in order to achieve resource priority and fairness during peak utilization. This can cause significant resource waste and delay job response time. In this paper, we propose using suspend-resume mechanisms to mitigate the overhead of preemption in cluster scheduling. Instead of killing preempted jobs or tasks, our approach uses a system level, application-transparent checkpointing mechanism to save the progress of jobs for resumption at a later time when resources are available. To reduce the preemption overhead and improve job response times, our approach uses adaptive preemption to dynamically select appropriate preemption mechanisms (e.g., kill vs. suspend, local vs. remote restore) according to the progress of a task and its suspend-resume overhead. By leveraging fast storage technologies, such as non-volatile memory (NVM), our approach can further reduce the preemption penalty to provide better QoS and resource efficiency. We implement the proposed approach and conduct extensive experiments via Google cluster trace-driven simulations and applications on a Hadoop cluster. The results demonstrate that our approach can significantly reduce the resource and power usage and improve application performance over existing approaches. In particular, our implementation on the next generation Hadoop YARN platform achieves up to a 67% reduction in resource wastage, 30% improvement in overall job response time times and 34% reduction in energy consumption over the current YARN scheduler.

[1]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[2]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[3]  Ion Stoica,et al.  True elasticity in multi-tenant data-intensive compute clusters , 2012, SoCC '12.

[4]  Prateek Sharma,et al.  SpotCheck: designing a derivative IaaS cloud on the spot market , 2015, EuroSys.

[5]  Mor Harchol-Balter,et al.  Exploiting process lifetime distributions for dynamic load balancing , 1995, SIGMETRICS.

[6]  Cristina L. Abad,et al.  Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.

[7]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[8]  Franck Cappello,et al.  Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[9]  Raouf Boutaba,et al.  Mitigating the negative impact of preemption on heterogeneous MapReduce workloads , 2011, 2011 7th International Conference on Network and Service Management.

[10]  Jorge-Arnulfo Quiané-Ruiz,et al.  RAFTing MapReduce: Fast recovery on the RAFT , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Dejan S. Milojicic,et al.  Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12]  Sanjay Kumar,et al.  System software for persistent memory , 2014, EuroSys '14.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[15]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[16]  M. Jette,et al.  Simple Linux Utility for Resource Management , 2009 .

[17]  M. Lankhorst,et al.  Low-cost and nanoscale non-volatile memory concept for future silicon chips , 2005, Nature materials.

[18]  Prashant J. Shenoy,et al.  Yank: Enabling Green Data Centers to Pull the Plug , 2013, NSDI.

[19]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[20]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[21]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[22]  Andrea Rosà,et al.  Quantifying the Brown Side of Priority Schedulers: Lessons from Big Clusters , 2014, PERV.

[23]  Yuan Xie,et al.  Hybrid checkpointing using emerging nonvolatile memories for future exascale systems , 2011, TACO.

[24]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.