论文信息 - Speculative Slot Reservation: Enforcing Service Isolation for Dependent Data-Parallel Computations

Speculative Slot Reservation: Enforcing Service Isolation for Dependent Data-Parallel Computations

Priority scheduling is a fundamental tool to provide service isolation for different jobs in shared clusters. Ideally, the performance of a high-priority job should not be dragged down by another with a lower priority. However, we show in this paper that simply assigning a high priority provides no isolation for jobs with dependent computations. A job, even receiving the highest priority, may give up compute slots to another before proceeding to the downstream computation, which is because of barrier, i.e., that the downstream computation cannot start until all the upstream tasks have completed. Such an interruption of execution inevitably results in a significant delay. In this paper, we propose speculative slot reservation that judiciously reserves slots for downstream computations, so as to retain service isolation for high-priority jobs. To mitigate the utilization loss due to slot reservation, we analyze the trade-off between utilization and isolation, and expose a tunable knob to navigate the trade-off. We also propose a complementary straggler mitigation strategy that uses the reserved slots to run extra copies of slow tasks. We have implemented speculative slot reservation in Spark. Evaluations based on both cluster deployment and trace-driven simulations show that our approach enforces strict service isolation for high-priority jobs, without slowing down the other jobs with a lower priority.

Bo Li | Chen Chen | Wei Wang

[1] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[2] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[3] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4] Benjamin Hindman,et al. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[5] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[6] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[7] Abhishek Verma,et al. Large-scale cluster management at Google with Borg , 2015, EuroSys.

[8] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[9] David E. Culler,et al. Hierarchical scheduling for diverse datacenter workloads , 2013, SoCC.

[10] Ding Yuan,et al. Don't Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems , 2016, OSDI.

[11] Aditya Akella,et al. Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.