TurboStream: Towards Low-Latency Data Stream Processing

Data Stream Processing (DSP) applications are often modelled as a directed acyclic graph: operators with data streams among them. Inter-operator communications can have a significant impact on the latency of DSP applications, accounting for 86% of the total latency. Despite their impact, there has been relatively little work on optimizing inter-operator communications, focusing on reducing inter-node traffic but not considering inter-process communication (IPC) inside a node, which often generates high latency due to the multiple memory-copy operations. This paper describes the design and implementation of TurboStream, a new DSP system designed specifically to address the high latency caused by inter-operator communications. To achieve this goal, we introduce (1) an improved IPC framework with OSRBuffer, a DSP-oriented buffer, to reduce memory-copy operations and waiting time of each single message when transmitting messages between the operators inside one node, and (2) a coarse-grained scheduler that consolidates operator instances and assigns them to nodes to diminish the inter-node IPC traffic. Using a prototype implementation, we show that our improved IPC framework reduces the end-to-end latency of intra-node IPC by 45.64% to 99.30%. Moreover, TurboStream reduces the latency of DSP by 83.23% compared to JStorm.

[1]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[4]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[5]  Richard T. B. Ma,et al.  Smooth Task Migration in Apache Storm , 2015, SIGMOD Conference.

[6]  Shrideep Pallickara,et al.  Online Scheduling and Interference Alleviation for Low-Latency, High-Throughput Processing of Data Streams , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Guillaume Mercier,et al.  Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[8]  Bingsheng He,et al.  AdaStorm: Resource Efficient Storm with Adaptive Configuration , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[9]  Cong Xu,et al.  JVM-Bypass for Efficient Hadoop Shuffling , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  Mohammad Hosseini,et al.  R-Storm: Resource-Aware Scheduling in Storm , 2015, Middleware.

[11]  Jian Tang,et al.  T-Storm: Traffic-Aware Online Scheduling in Storm , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[12]  Kun-Lung Wu,et al.  SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems , 2008, Middleware.

[13]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[14]  Thomas S. Heinze,et al.  Latency-aware elastic scaling for distributed data stream processing systems , 2014, DEBS '14.

[15]  Sayantan Sur,et al.  LiMIC: support for high-performance MPI intra-node communication on Linux cluster , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[16]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[17]  George Bosilca,et al.  Locality and Topology Aware Intra-node Communication among Multicore CPUs , 2010, EuroMPI.

[18]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[19]  Raul Castro Fernandez,et al.  Making State Explicit for Imperative Big Data Processing , 2014, USENIX Annual Technical Conference.

[20]  Guillaume Mercier,et al.  Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[21]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[22]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[23]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[24]  Hai Jin,et al.  Runtime‐aware adaptive scheduling in stream processing , 2016, Concurr. Comput. Pract. Exp..