Quantitative Impact Evaluation of an Abstraction Layer for Data Stream Processing Systems

With the demand to process ever-growing data volumes, a variety of new data stream processing frameworks have been developed. Moving an implementation from one such system to another, e.g., for performance reasons, requires adapting existing applications to new interfaces. Apache Beam addresses these high substitution costs by providing an abstraction layer that enables executing programs on any of the supported streaming frameworks. In this paper, we present a novel benchmark architecture for comparing the performance impact of using Apache Beam on three streaming frameworks: Apache Spark Streaming, Apache Flink, and Apache Apex. We find significant performance penalties when using Apache Beam for application development in the surveyed systems. Overall, usage of Apache Beam for the examined streaming applications caused a high variance of query execution times with a slowdown of up to a factor of 58 compared to queries developed without the abstraction layer. All developed benchmark artifacts are publicly available to ensure reproducible results.

[1]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[2]  María S. Pérez-Hernández,et al.  Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[4]  Hasso Plattner,et al.  Object-Relational Mapping Revisited - A Quantitative Study on the Impact of Database Technology on O/R Mapping Strategies , 2017, HICSS.

[5]  Guenter Hesse,et al.  Conceptual Survey on Data Stream Processing Systems , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[6]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[7]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[8]  Daniel Lemire,et al.  Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources , 2018, SIGMOD Conference.

[9]  Reynold Xin,et al.  Apache Spark , 2016 .

[10]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Data Processing Systems , 2019, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[11]  Otto Carlos Muniz Bandeira Duarte,et al.  A Performance Comparison of Open-Source Stream Processing Platforms , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[12]  Gang Wu,et al.  Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[13]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[14]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[15]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[16]  Eric A. Brewer,et al.  Kubernetes and the path to cloud native , 2015, SoCC.

[17]  Yi Pan,et al.  SamzaSQL: Scalable Fast Data Management with Streaming SQL , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[18]  Milind Bhandarkar,et al.  AdBench: A Complete Benchmark for Modern Data Pipelines , 2016, TPCTC.

[19]  Peter A. Tucker,et al.  NEXMark – A Benchmark for Queries over Data Streams DRAFT , 2002 .

[20]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[21]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[22]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[23]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[24]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[25]  Kun-Lung Wu,et al.  Challenges and Experiences in Building an Efficient Apache Beam Runner For IBM Streams , 2018, Proc. VLDB Endow..

[26]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[27]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[28]  Michael Stonebraker,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone (Abstract) , 2005, ICDE.

[29]  Jennifer Widom,et al.  Towards a streaming SQL standard , 2008, Proc. VLDB Endow..