BoPF: Mitigating the Burstiness-Fairness Tradeoff in Multi-Resource Clusters

Even though batch, interactive, and streaming applications all care about performance, their notions of performance are different. For instance, while the average completion time can suffciently capture the performance of a throughout-sensitive batch-job queue (TQ) [5], interactive sessions and streaming applications form latencysensitive queues (LQ): each LQ is a sequence of small jobs following an ON-OFF pattern. For these jobs [7], individual completion times or latencies are far more important than the average completion time or the throughput of the LQ. Indeed, existing "fair" schedulers are inherently unfair to LQ jobs: when LQ jobs are present (ON state), they must share the resources equally with TQ jobs, but when they are absent (OFF state), batch jobs get all the resources. In the long run, TQs receive more resources than their fair shares because today's schedulers such as Dominant Resource Fairness [4] make instantaneous decisions Clearly, it is impossible to achieve the best response time for LQ jobs under instantaneous fairness. In other words, there is a hard tradeoff between providing instantaneous fairness for TQs and minimizing the response time of LQs. However, instantaneous fairness is not necessary for TQs because average-completion time over a relatively long time horizon is their most important metric. This sheds light on the following question: how well can we simultaneously accommodate multiple classes of workloads with performance guarantees, in particular, isolation protection for TQs in terms of long-term fairness and low response times for LQs? This work serves as our first step in answering the question by designing BoPF: the first multi-resource scheduler that achieves both isolation protection for TQs and response time guarantees for LQs in a strategyproof way. The key idea is "bounded" priority for LQs: as long as the burst is not too large to hurt the long-term fair share of TQs and other LQs, they are given higher priority so jobs can be completed as quickly as possible.

[1]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[2]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[3]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[4]  Scott Shenker,et al.  Analysis and simulation of a fair queueing algorithm , 1989, SIGCOMM 1989.

[5]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[6]  Johannes Josef Schneider,et al.  Stochastic optimization , 2006, Scientific computation.

[7]  Mung Chiang,et al.  Multiresource allocation: fairness-efficiency tradeoffs in a unifying framework , 2013, TNET.

[8]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[9]  Scott Shenker,et al.  Choosy: max-min fair sharing for datacenter jobs with constraints , 2013, EuroSys '13.

[10]  Hitesh Ballani,et al.  End-to-end Performance Isolation Through Virtual Datacenters , 2014, OSDI.

[11]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[12]  Srikanth Kandula,et al.  Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters , 2016 .

[13]  H. Moulin Cooperative Microeconomics: A Game-Theoretic Introduction , 1995 .

[14]  Rene L. Cruz,et al.  A calculus for network delay, Part I: Network elements in isolation , 1991, IEEE Trans. Inf. Theory.

[15]  Archana Ganapathi,et al.  Analyzing Log Analysis: An Empirical Study of User Log Mining , 2014, LISA.

[16]  Amin Vahdat,et al.  BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing , 2015, Comput. Commun. Rev..

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[19]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[20]  David R. Cheriton,et al.  Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler , 1999, OPSR.

[21]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[22]  Patrick Jaillet,et al.  Online Optimization , 2011 .

[23]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[24]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[25]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[26]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[27]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[28]  David E. Culler,et al.  Hierarchical scheduling for diverse datacenter workloads , 2013, SoCC.

[29]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[30]  Surajit Chaudhuri,et al.  A Statistical Approach Towards Robust Progress Estimation , 2011, Proc. VLDB Endow..

[31]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[32]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[33]  Jeffrey M. Jaffe,et al.  Bottleneck Flow Control , 1981, IEEE Trans. Commun..

[34]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[35]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[36]  Xiaobo Zhou,et al.  Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization , 2017, USENIX Annual Technical Conference.

[37]  Van Jacobson,et al.  Link-sharing and resource management models for packet networks , 1995, TNET.

[38]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[39]  Leonard Kleinrock,et al.  Queueing Systems: Problems and Solutions , 1974 .

[40]  Ion Stoica,et al.  Efficient Coflow Scheduling Without Prior Knowledge , 2015, SIGCOMM.

[41]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[42]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[43]  Robert N. M. Watson,et al.  Queues Don't Matter When You Can JUMP Them! , 2015, NSDI.

[44]  Srikanth Kandula,et al.  Achieving high utilization with software-driven WAN , 2013, SIGCOMM.

[45]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[46]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[47]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[48]  Zheng Wang,et al.  An Architecture for Differentiated Services , 1998, RFC.

[49]  Ion Stoica,et al.  A hierarchical fair service curve algorithm for link-sharing, real-time and priority services , 1997, SIGCOMM '97.

[50]  Justine Sherry,et al.  Silo: Predictable Message Completion Time in the Cloud , 2013 .

[51]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[52]  David L. Black,et al.  An Architecture for Differentiated Service , 1998 .

[53]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[54]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[55]  Mor Harchol-Balter,et al.  Analysis of SRPT scheduling: investigating unfairness , 2001, SIGMETRICS '01.

[56]  Vyas Sekar,et al.  Multi-resource fair queueing for packet processing , 2012, CCRV.

[57]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[58]  Zhenhua Liu,et al.  HUG: Multi-Resource Fairness for Correlated and Elastic Demands , 2016, NSDI.

[59]  Surajit Chaudhuri,et al.  Estimating progress of execution for SQL queries , 2004, SIGMOD '04.

[60]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[61]  Jeffrey F. Naughton,et al.  Toward a progress indicator for database queries , 2004, SIGMOD '04.

[62]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[63]  Rene L. Cruz,et al.  A calculus for network delay, Part II: Network analysis , 1991, IEEE Trans. Inf. Theory.

[64]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[65]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[66]  Yin Wang,et al.  Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems , 2015, USENIX Annual Technical Conference.

[67]  Ravi Sethi,et al.  The Complexity of Flowshop and Jobshop Scheduling , 1976, Math. Oper. Res..

[68]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[69]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[70]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[71]  Christina Delimitrou,et al.  PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services , 2019, ASPLOS.