FlowTime: Dynamic Scheduling of Deadline-Aware Workflows and Ad-Hoc Jobs

With rapidly increasing volumes of data to be processed in modern data analytics, it is commonplace to run multiple data processing jobs with inter-job dependencies in a datacenter cluster, typically as recurring data processing workloads. Such a group of inter-dependent data analytic jobs is referred to as a workflow, and may have a deadline due to its mission-critical nature. In contrast, non-recurring ad-hoc jobs are typically best-effort in nature, and rather than meeting deadlines, it is desirable to minimize their average job turnaround time. The state-of-the-art scheduling mechanisms focused on meeting deadlines for individual jobs only, and are oblivious to workflow deadlines. In this paper, we present FlowTime, a new system framework designed to make scheduling decisions for workflows so that their deadlines are met, while simultaneously optimizing the performance of ad-hoc jobs. To achieve this objective, we first adopt a divide-and-conquer strategy to transform the problem of workflow scheduling to a deadline-aware job scheduling problem, and then design an efficient algorithm that tackles the scheduling problem with both deadline-aware jobs and ad-hoc jobs by solving its corresponding optimization problem directly using a linear program solver. Our experimental results have clearly demonstrated that FlowTime achieves the lowest deadline-miss rates for deadline-aware workflows and 2-10 times shorter average job turnaround time, as compared to the state-of-the-art scheduling algorithms.

[1]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[2]  Jun Luo,et al.  Time- and Cost- Efficient Task Scheduling across Geo-Distributed Data Centers , 2018, IEEE Transactions on Parallel and Distributed Systems.

[3]  Minlan Yu,et al.  Scheduling jobs across geo-distributed datacenters , 2015, SoCC.

[4]  Carlo Curino,et al.  Global Analytics in the Face of Bandwidth and Regulatory Constraints , 2015, NSDI.

[5]  Mung Chiang,et al.  Need for speed: CORA scheduler for optimizing completion-times in the cloud , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[6]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[7]  Carlo Curino,et al.  Towards Geo-Distributed Machine Learning , 2017, IEEE Data Eng. Bull..

[8]  Aditya Akella,et al.  CLARINET: WAN-Aware Optimization for Analytics Queries , 2016, OSDI.

[9]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[10]  Jun Luo,et al.  Flutter: Scheduling tasks closer to data across geo-distributed datacenters , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[11]  Carlo Curino,et al.  WANalytics: Analytics for a Geo-Distributed Data-Intensive World , 2015, CIDR.

[12]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[13]  Rajkumar Buyya,et al.  Cost-based scheduling of scientific workflow applications on utility grids , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[14]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[15]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[16]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[17]  Dick H. J. Epema,et al.  Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds , 2013, Future Gener. Comput. Syst..

[18]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[21]  Indranil Gupta,et al.  WOHA: Deadline-Aware Map-Reduce Workflow Scheduling Framework over Hadoop Clusters , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[22]  Robert R. Meyer,et al.  A Class of Nonlinear Integer Programs Solvable by a Single Linear Program , 1977 .