Quantitative Impact Evaluation of an Abstraction Layer for Data Stream Processing Systems

With the demand to process ever-growing data volumes, a variety of new data stream processing frameworks have been developed. Moving an implementation from one such system to another, e.g., for performance reasons, requires adapting existing applications to new interfaces. Apache Beam addresses these high substitution costs by providing an abstraction layer that enables executing programs on any of the supported streaming frameworks. In this paper, we present a novel benchmark architecture for comparing the performance impact of using Apache Beam on three streaming frameworks: Apache Spark Streaming, Apache Flink, and Apache Apex. We find significant performance penalties when using Apache Beam for application development in the surveyed systems. Overall, usage of Apache Beam for the examined streaming applications caused a high variance of query execution times with a slowdown of up to a factor of 58 compared to queries developed without the abstraction layer. All developed benchmark artifacts are publicly available to ensure reproducible results.

[1]  Guenter Hesse,et al.  Conceptual Survey on Data Stream Processing Systems , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[2]  Otto Carlos Muniz Bandeira Duarte,et al.  A Performance Comparison of Open-Source Stream Processing Platforms , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[3]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[4]  Michael Stonebraker,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone (Abstract) , 2005, ICDE.

[5]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[6]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[7]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[8]  Gang Wu,et al.  Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[9]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[10]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[11]  Daniel Lemire,et al.  Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources , 2018, SIGMOD Conference.

[12]  Milind Bhandarkar,et al.  AdBench: A Complete Benchmark for Modern Data Pipelines , 2016, TPCTC.

[13]  Peter A. Tucker,et al.  NEXMark – A Benchmark for Queries over Data Streams DRAFT , 2002 .

[14]  Jennifer Widom,et al.  Towards a streaming SQL standard , 2008, Proc. VLDB Endow..

[15]  María S. Pérez-Hernández,et al.  Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[16]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[17]  Reynold Xin,et al.  Apache Spark , 2016 .

[18]  Eric A. Brewer,et al.  Kubernetes and the path to cloud native , 2015, SoCC.

[19]  Yi Pan,et al.  SamzaSQL: Scalable Fast Data Management with Streaming SQL , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Hasso Plattner,et al.  Object-Relational Mapping Revisited - A Quantitative Study on the Impact of Database Technology on O/R Mapping Strategies , 2017, HICSS.

[21]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[22]  Kun-Lung Wu,et al.  Challenges and Experiences in Building an Efficient Apache Beam Runner For IBM Streams , 2018, Proc. VLDB Endow..

[23]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[24]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[25]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[26]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[27]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.