Whale: Efficient One-to-Many Data Partitioning in RDMA-Assisted Distributed Stream Processing Systems

To process large-scale real-time data streams, existing distributed stream processing systems (DSPSs) leverage different stream partitioning strategies. The one-to-many data partitioning strategy plays an important role in various applications. With one-to-many data partitioning, an upstream processing instance sends a generated tuple to a potentially large number of downstream processing instances. Existing DSPSs leverage an instance-oriented communication mechanism, where an upstream instance transmits a tuple to different downstream instances separately. However, in one-to-many data partitioning, multiple downstream instances typically run on the same machine to exploit multi-core resources. Therefore, a DSPS actually sends a data item to a machine multiple times, raising significant unnecessary costs for serialization and communication. We show that such a mechanism can lead to serious performance bottleneck due to CPU overload. To address the problem, we design and implement Whale, an efficient RDMA (Remote Direct Memory Access) assisted distributed stream processing system. Two factors contribute to the efficiency of this design. First, we propose a novel RDMA-assisted stream multicast scheme with a self-adjusting non-blocking tree structure to alleviate the CPU workloads of an upstream instance during one-to-many data partitioning. Second, we re-design the communication mechanism in existing DSPSs by replacing the instance-oriented communication with a new worker-oriented communication scheme, which saves significant costs for redundant serialization and communication. We implement Whale on top of Apache Storm and conduct comprehensive experiments to evaluate its performance with large-scale real world datasets. The results show that Whale achieves 56.6× improvement of system throughput and 97% reduction of processing latency compared to existing designs.

[1]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[2]  Jieping Ye,et al.  The Simpler The Better: A Unified Approach to Predicting Original Taxi Demands based on Large-Scale Online Platforms , 2017, KDD.

[3]  Gianmarco De Francisci Morales,et al.  The power of both choices: Practical load balancing for distributed stream processing engines , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[4]  Hai Jin,et al.  Minimizing Inter-Server Communications by Exploiting Self-Similarity in Online Social Networks , 2012, IEEE Transactions on Parallel and Distributed Systems.

[5]  D. Kendall Stochastic Processes Occurring in the Theory of Queues and their Analysis by the Method of the Imbedded Markov Chain , 1953 .

[6]  Paolo Costa,et al.  Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems , 2018, Proc. VLDB Endow..

[7]  Mark A. Davenport,et al.  Estimation of Poisson Arrival Processes Under Linear Models , 2019, IEEE Transactions on Information Theory.

[8]  Jean Vuillemin,et al.  A data structure for manipulating priority queues , 1978, CACM.

[9]  Bugra Gedik Partitioning functions for stateful data parallelism in stream processing , 2013, The VLDB Journal.

[10]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Jeyhun Karimov,et al.  Analyzing Efficient Stream Processing on Modern Hardware , 2019, Proc. VLDB Endow..

[12]  Kai Chen,et al.  Towards Zero Copy Dataflows using RDMA , 2017, SIGCOMM Posters and Demos.

[13]  Steven Swanson,et al.  This paper is included in the Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20) , 2022 .

[14]  Siwoon Son,et al.  Performance improvement of Apache Storm using InfiniBand RDMA , 2019, The Journal of Supercomputing.

[15]  Kang Chen,et al.  RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[16]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Data Processing Systems , 2019, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[17]  Mingwei Xu,et al.  How Powerful Switches Should be Deployed: A Precise Estimation Based on Queuing Theory , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[18]  Martin Kleppmann,et al.  Kafka, Samza and the Unix Philosophy of Distributed Data , 2015, IEEE Data Eng. Bull..

[19]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[20]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[21]  Fan Zhang,et al.  Popularity-aware differentiated distributed stream processing on skewed streams , 2017, 2017 IEEE 25th International Conference on Network Protocols (ICNP).

[22]  Anshul Jaiswal,et al.  Providing Streaming Joins as a Service at Facebook , 2018, Proc. VLDB Endow..