Swift: Reliable and Low-Latency Data Processing at Cloud Scale

Nowadays, it is a rapidly rising demand yet challenging issue to run large-scale applications on shared infrastructures such as data centers and clouds with low execution latency and high resource utilization. This paper reports our experience with Swift, a system capable of efficiently running real-time and interactive data processing jobs at cloud scale. Taking directed acyclic graph (DAG) as the job model, Swift achieves the design goal by three new mechanisms: 1) fine-grained scheduling that can efficiently partition a job into graphlets (i.e., sub-graphs) based on new shuffle heuristics and that does scheduling in the unit of graphlet, thus avoiding resource fragmentation and waste, 2) adaptive memory-based in-network shuffling that reduces IO overhead and data transfer time by doing shuffle in memory and allowing jobs to select the most efficient way to fulfill shuffling, and 3) lightweight fault tolerance and recovery that only prolong the whole job execution time slightly with the help of timely failure detection and fine-grained failure recovery. Experimental results show that Swift can achieve an average speedup of 2.11× on TPC-H, and 14.18× on Terasort when compared with Spark. Swift has been deployed in production, supporting as many as 140,000 executors and processing millions of jobs per day. Experiments with production traces show that Swift outperforms JetScope and Bubble Execution by 2.44× and 1.23× respectively.

[1]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[2]  John C. S. Lui,et al.  G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[3]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[4]  Dan Delorey,et al.  Dremel , 2020, Proc. VLDB Endow..

[5]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[6]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[7]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[8]  Srinivasan Parthasarathy,et al.  Fractal: A General-Purpose Graph Pattern Mining System , 2019, SIGMOD Conference.

[9]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[10]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[11]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[12]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[13]  Nikhil R. Devanur,et al.  Bubble Execution: Resource-aware Reliable Analytics at Cloud Scale , 2018, Proc. VLDB Endow..

[14]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[15]  Michael J. Freedman,et al.  Riffle: optimized shuffle service for large-scale data analytics , 2018, EuroSys.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[18]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[19]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[20]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Panos Kalnis,et al.  Lusail: A System for Querying Linked Data at Scale , 2017, Proc. VLDB Endow..

[22]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[23]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[24]  Samer Al-Kiswany,et al.  An Analysis of Network-Partitioning Failures in Cloud Systems , 2018, OSDI.

[25]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[26]  Chuang Lin,et al.  Modeling and understanding TCP incast in data center networks , 2011, 2011 Proceedings IEEE INFOCOM.

[27]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[28]  Bo Wang,et al.  ActCap: Accelerating MapReduce on heterogeneous clusters with capability-aware data placement , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[29]  Yu Liu,et al.  ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs , 2017, Proc. VLDB Endow..

[30]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[31]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[32]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[33]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[34]  Jie Xu,et al.  Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover , 2017, IEEE Transactions on Services Computing.

[35]  Alexander Aiken,et al.  A Distributed Multi-GPU System for Fast Graph Processing , 2017, Proc. VLDB Endow..

[36]  Ricardo Bianchini,et al.  History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters , 2016, OSDI.

[37]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[38]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[39]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[40]  Xiaoyu Chen,et al.  JetScope: Reliable and Interactive Analytics at Cloud Scale , 2015, Proc. VLDB Endow..

[41]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[42]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[43]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[44]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[45]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[46]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[47]  John Allen,et al.  Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..

[48]  Chao Li,et al.  Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[49]  Rob J Hyndman,et al.  Sample Quantiles in Statistical Packages , 1996 .

[50]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[51]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[52]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[53]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.