Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka

In recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with a diverse feature set. On the other hand, Big Data applications have started to make use of High-Performance Computing (HPC) which possess superior memory, I/O, and networking resources compared to typical Big Data clusters. Recent studies evaluating the performance of SPEs have focused on commodity clusters. However, exhaustive studies need to be performed to profile individual stages of a stream processing pipeline and how best to optimize each of these stages to best leverage the resources provided by HPC clusters. To address this issue, we profile the performance of a big data streaming pipeline using Apache Flink as the SPE and Apache Kafka as the intermediate message queue. We break the streaming pipeline into two distinct phases and evaluate percentile latencies for two different networks, namely 40GbE and InfiniBand EDR (100Gbps), to determine if a typical streaming application is network intensive enough to benefit from a faster interconnect. Moreover, we explore whether the volume of input data stream has any effect on the latency characteristics of the streaming pipeline, and if so how does it compare for different stages in the streaming pipeline and different network interconnects. Our experiments show an increase of over 10x in 98 percentile latency when input stream volume is increased from 128MB/s to 256MB/s. Moreover, we find the intermediate stages of the stream pipeline to be a significant contributor to the overall latency of the system.

[1]  Ricardo Jiménez-Peris,et al.  CumuloNimbo: A Cloud Scalable Multi-tier SQL Database , 2015, IEEE Data Eng. Bull..

[2]  María S. Pérez-Hernández,et al.  Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Michael Stonebraker,et al.  The design of POSTGRES , 1986, SIGMOD '86.

[4]  Hsiao-Keng Jerry Chu,et al.  Transmission of IP over InfiniBand (IPoIB) , 2006, RFC.

[5]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[6]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[7]  Shengsheng Huang,et al.  HiBench : A Representative and Comprehensive Hadoop Benchmark Suite , 2012 .

[8]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[9]  Jie Huang,et al.  Benchmarking modern distributed streaming platforms , 2016, 2016 IEEE International Conference on Industrial Technology (ICIT).

[10]  Tore Risch,et al.  Processing High-Volume Stream Queries on a Supercomputer , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[11]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[12]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[13]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[15]  Zhitao Shen,et al.  CSA: Streaming Engine for Internet of Things , 2015, IEEE Data Eng. Bull..

[16]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[17]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[18]  Alexander Jung,et al.  Big Data Frameworks: A Comparative Study , 2016, ArXiv.

[19]  Zhuo Liu,et al.  Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[21]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[22]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[23]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[24]  Seif Haridi,et al.  Lightweight Asynchronous Snapshots for Distributed Dataflows , 2015, ArXiv.

[25]  Martin Kleppmann,et al.  Kafka, Samza and the Unix Philosophy of Distributed Data , 2015, IEEE Data Eng. Bull..

[26]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[27]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[28]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[29]  Alain Biem,et al.  IBM infosphere streams for scalable, real-time, intelligent transportation services , 2010, SIGMOD Conference.

[30]  Christof Fetzer,et al.  FUGU: Elastic Data Stream Processing with Latency Constraints , 2015, IEEE Data Eng. Bull..

[31]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[32]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.