A Survey of Distributed Data Stream Processing Frameworks

Big data processing systems are evolving to be more stream oriented where each data record is processed as it arrives by distributed and low-latency computational frameworks on a continuous basis. As the stream processing technology matures and more organizations invest in digital transformations, new applications of stream analytics will be identified and implemented across a wide spectrum of industries. One of the challenges in developing a streaming analytics infrastructure is the difficulty in selecting the right stream processing framework for the different use cases. With a view to addressing this issue, in this paper we present a taxonomy, a comparative study of distributed data stream processing and analytics frameworks, and a critical review of representative open source (Storm, Spark Streaming, Flink, Kafka Streams) and commercial (IBM Streams) distributed data stream processing frameworks. The study also reports our ongoing study on a multilevel streaming analytics architecture that can serve as a guide for organizations and individuals planning to implement a real-time data stream processing and analytics framework.

[1]  Guenter Hesse,et al.  Conceptual Survey on Data Stream Processing Systems , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[2]  K. Morik A Survey of the Stream Processing Landscape , 2014 .

[3]  Flavio Junqueira,et al.  ZooKeeper: Distributed Process Coordination , 2013 .

[4]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[5]  Seif Haridi,et al.  State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing , 2017, Proc. VLDB Endow..

[6]  Frédéric Andrès,et al.  Challenges and opportunities with big data visualization , 2015, MEDES.

[7]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[8]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[9]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[10]  Mengchen Liu,et al.  A survey on information visualization: recent advances and challenges , 2014, The Visual Computer.

[11]  Hien Luu Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library , 2018 .

[12]  Sasu Tarkoma,et al.  A survey of systems for massive stream analytics , 2016, 1605.09021.

[13]  Kun-Lung Wu,et al.  Consistent Regions: Guaranteed Tuple Processing in IBM Streams , 2016, Proc. VLDB Endow..

[14]  Lajos Jeno Fülöp,et al.  Survey on Complex Event Processing and Predictive Analytics , 2010 .

[15]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Processing Engines , 2018, ICDE.

[16]  Liu Chen,et al.  A Survey on NoSQL Stores , 2018, ACM Comput. Surv..

[17]  Rajkumar Buyya,et al.  A Taxonomy and Survey of Stream Processing Systems , 2017 .

[18]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[19]  Jeyhun Karimov,et al.  Benchmarking Distributed Stream Data Processing Systems , 2019, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[20]  Zhuo Liu,et al.  Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[21]  Andrew Psaltis Streaming Data: Understanding the real-time pipeline , 2017 .

[22]  Stanley B. Zdonik,et al.  Data Ingestion for the Connected World , 2017, CIDR.

[23]  Xike Xie,et al.  Survey of real-time processing systems for big data , 2014, IDEAS.

[24]  Anshul Jaiswal,et al.  Realtime Data Processing at Facebook , 2016, SIGMOD Conference.

[25]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[26]  Gianmarco De Francisci Morales,et al.  SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[27]  Valeria Cardellini,et al.  Reinforcement Learning Based Policies for Elastic Stream Processing on Heterogeneous Resources , 2019, DEBS.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Laura M. Haas,et al.  SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems , 2010, Proc. VLDB Endow..

[30]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[31]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[32]  Yogesh L. Simmhan,et al.  RIoTBench: An IoT benchmark for distributed stream processing systems , 2017, Concurr. Comput. Pract. Exp..

[33]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[34]  Nasseh Tabrizi,et al.  A Survey on Real-Time Big Data Analytics: Applications and Tools , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[35]  Yi Pan,et al.  SamzaSQL: Scalable Fast Data Management with Streaming SQL , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[36]  Shahzad Khan,et al.  A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning , 2019, 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).

[37]  David Maier,et al.  Semantics and evaluation techniques for window aggregates in data streams , 2005, SIGMOD '05.

[38]  Sami Faïz,et al.  Real-Time Data Stream Partitioning over a Sliding Window in Real-Time Spatial Big Data , 2018, ICA3PP.

[39]  David Maier,et al.  Semantics of Data Streams and Operators , 2005, ICDT.

[40]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[41]  Rajkumar Buyya,et al.  Distributed data stream processing and edge computing: A survey on resource elasticity and future directions , 2017, J. Netw. Comput. Appl..

[42]  Srinath Perera,et al.  Wihidum: Distributed complex event processing , 2015, J. Parallel Distributed Comput..

[43]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[44]  Olawande Daramola,et al.  Big data stream analysis: a systematic literature review , 2019, Journal of Big Data.

[45]  Marcin Gorawski,et al.  A Survey of Data Stream Processing Tools , 2014, ISCIS.

[46]  Jimmy J. Lin,et al.  GraphJet: Real-Time Content Recommendations at Twitter , 2016, Proc. VLDB Endow..

[47]  Valeriu Manuel Ionescu The analysis of the performance of RabbitMQ and ActiveMQ , 2015, 2015 14th RoEduNet International Conference - Networking in Education and Research (RoEduNet NER).

[48]  Sherif Sakr Big Data 2.0 Processing Systems , 2016, SpringerBriefs in Computer Science.

[49]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[50]  Sandra Geisler,et al.  Data Stream Management Systems , 2013, Data Exchange, Information, and Streams.

[51]  Emanuele Della Valle,et al.  Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread , 2018, DEBS.

[52]  Martin Hirzel,et al.  Language Runtime and Optimizations in IBM Streams , 2015, IEEE Data Eng. Bull..

[53]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[54]  Geoffrey Fox,et al.  Survey of Distributed Stream Processing , 2016 .

[55]  B. Mitschang,et al.  Progressive Recovery of Correlated Failures in Distributed Stream Processing Engines , 2017 .

[56]  Patrick Martin,et al.  The Six Pillars for Building Big Data Analytics Ecosystems , 2016, ACM Comput. Surv..

[57]  Keqin Li,et al.  Key Technologies for Big Data Stream Computing , 2015, Big Data - Algorithms, Analytics, and Applications.

[58]  Arun Kejariwal,et al.  Real Time Analytics: Algorithms and Systems , 2015, Proc. VLDB Endow..

[59]  Volker Markl,et al.  A Comparison of Distributed Stream Processing Systems for Time Series Analysis , 2019, BTW.

[60]  Pekka Pääkkönen Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing , 2016, Journal of Big Data.

[61]  Gianmarco De Francisci Morales,et al.  The power of both choices: Practical load balancing for distributed stream processing engines , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[62]  Volker Markl,et al.  A survey of state management in big data processing systems , 2017, The VLDB Journal.

[63]  Inder Monga,et al.  Lambda architecture for cost-effective batch and speed big data processing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[64]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[65]  János Dániel Bali Streaming Graph Analytics Framework Design , 2015 .