Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications

Distributed dataflow systems allow users to express a wide range of computations, including batch, streaming, and machine learning. A recent trend is to unify different computation types as part of a single stream/batch application that combines latency-sensitive ("stream") and latency-tolerant ("batch") jobs. This sharing of state and logic across jobs simplifies application development. Examples include machine learning applications that perform batch training and low-latency inference, and data analytics applications that include batch data transformations and low-latency querying. Existing execution engines, however, were not designed for unified stream/batch applications. As we show, they fail to schedule and execute them efficiently while respecting their diverse requirements. We present Neptune, an execution framework for stream/batch applications that dynamically prioritizes tasks to achieve low latency for stream jobs. Neptune employs coroutines as a lightweight mechanism for suspending tasks without losing task progress. It couples this fine-grained control over CPU resources with a locality-and memory-aware (LMA) scheduling policy to determine which tasks to suspend and when, thereby sharing executors among heterogeneous jobs. We evaluate our open-source Spark-based implementation of Neptune on a 75-node Azure cluster. Neptune achieves up to 3x lower end-to-end processing latencies for latency-sensitive jobs of a stream/batch application, while minimally impacting the throughput of batch jobs.

[1]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[2]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[3]  Norman May,et al.  Interleaving with Coroutines: A Practical Approach for Robust Index Joins , 2017, Proc. VLDB Endow..

[4]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[5]  Jim Hunter,et al.  Exploiting Coroutines to Attack the "Killer Nanoseconds" , 2018, Proc. VLDB Endow..

[6]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[7]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[8]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[9]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[10]  Gregory R. Ganger,et al.  3Sigma: distribution-based cluster scheduling for runtime uncertainty , 2018, EuroSys.

[11]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[12]  Peter R. Pietzuch,et al.  Medea: scheduling of long running applications in shared production clusters , 2018, EuroSys.

[13]  Willy Zwaenepoel,et al.  Kairos: Preemptive Data Center Scheduling Without Runtime Estimates , 2018, SoCC.

[14]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[15]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[16]  Srikanth Kandula,et al.  Efficient queue management for cluster scheduling , 2016, EuroSys.

[17]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[18]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[19]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[20]  Thomas Weise,et al.  Apache Apex , 2019, Encyclopedia of Big Data Technologies.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[23]  Xiaobo Zhou,et al.  Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization , 2017, USENIX Annual Technical Conference.

[24]  Lu Fang,et al.  Interruptible tasks: treating memory pressure as interrupts for highly scalable data-parallel programs , 2015, SOSP.

[25]  Scott Shenker,et al.  The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[26]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[27]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[28]  Fengyun Liu,et al.  Theory and Practice of Coroutines with Snapshots , 2018, ECOOP.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[31]  Abutalib Aghayev,et al.  Litz: Elastic Framework for High-Performance Distributed Machine Learning , 2018, USENIX Annual Technical Conference.

[32]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[33]  Thomas Neumann,et al.  TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark , 2013, TPCTC.

[34]  Barzan Mozafari,et al.  SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics , 2017, CIDR.

[35]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[36]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[37]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.