Beyond Analytics: The Evolution of Stream Processing Systems

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. The goal of this tutorial is threefold. First, we aim to review and highlight noteworthy past research findings, which were largely ignored until very recently. Second, we intend to underline the differences between early ('00-'10) and modern ('11-'18) streaming systems, and how those systems have evolved through the years. Most importantly, we wish to turn the attention of the database community to recent trends: streaming systems are no longer used only for classic stream processing workloads, namely window aggregates and joins. Instead, modern streaming systems are being increasingly used to deploy general event-driven applications in a scalable fashion, challenging the design decisions, architecture and intended use of existing stream processing systems.

[1]  Asterios Katsifodimos,et al.  Stateful Functions as a Service in Action , 2019, Proc. VLDB Endow..

[2]  Alastair R. Beresford,et al.  Online Event Processing: Achieving Consistency Where Distributed Transactions Have Failed , 2019 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Laura M. Haas,et al.  SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems , 2010, Proc. VLDB Endow..

[5]  Jennifer Widom,et al.  Towards a streaming SQL standard , 2008, Proc. VLDB Endow..

[6]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[7]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[8]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9]  Thomas S. Heinze,et al.  Cloud-based data stream processing , 2014, DEBS '14.

[10]  Badrish Chandramouli,et al.  FASTER: A Concurrent Key-Value Store with In-Place Updates , 2018, SIGMOD Conference.

[11]  Pat Hanrahan,et al.  Fleet: A Framework for Massively Parallel Streaming on FPGAs , 2020, ASPLOS.

[12]  Michael Stonebraker,et al.  S-Store: Streaming Meets Transaction Processing , 2015, Proc. VLDB Endow..

[13]  Vasiliki Kalavri,et al.  Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows , 2018, OSDI.

[14]  Seif Haridi,et al.  State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing , 2017, Proc. VLDB Endow..

[15]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[16]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[17]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[18]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[19]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[20]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[21]  Asterios Katsifodimos,et al.  Operational Stream Processing: Towards Scalable and Consistent Event-Driven Applications , 2019, EDBT.

[22]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[23]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[24]  Kenneth Knowles,et al.  One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables , 2019, SIGMOD Conference.

[25]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[26]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[27]  Feng Zhang,et al.  Hardware-Conscious Stream Processing , 2020, SIGMOD Rec..

[28]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[29]  Martin Hirzel,et al.  Tutorial: stream processing optimizations , 2013, DEBS.

[30]  Opher Etzion,et al.  Event processing , 2010, Proc. VLDB Endow..

[31]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[32]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[33]  Raul Castro Fernandez,et al.  Making State Explicit for Imperative Big Data Processing , 2014, USENIX Annual Technical Conference.

[34]  Michael Stonebraker,et al.  S-Store: A Streaming NewSQL System for Big Velocity Applications , 2014, Proc. VLDB Endow..

[35]  Sebastian Burckhardt,et al.  A.M.B.R.O.S.I.A , 2020, Proc. VLDB Endow..

[36]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[37]  Philip A. Bernstein,et al.  Orleans: Distributed Virtual Actors for Programmability and Scalability , 2014 .

[38]  James R. Larus,et al.  Orleans: cloud computing for everyone , 2011, SoCC.

[39]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[40]  Vasiliki Kalavri,et al.  Megaphone: Latency-conscious state migration for distributed streaming dataflows , 2018, Proc. VLDB Endow..

[41]  Torsten Hoefler,et al.  Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism , 2019, ArXiv.

[42]  Jonathan Goldstein,et al.  Consistent Streaming Through Time: A Vision for Event Stream Processing , 2006, CIDR.

[43]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[44]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[45]  Sriram Rao,et al.  Dhalion: Self-Regulating Stream Processing in Heron , 2017, Proc. VLDB Endow..

[46]  Michael Philippsen,et al.  Reliable speculative processing of out-of-order event streams in generic publish/subscribe middlewares , 2013, DEBS '13.

[47]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[48]  Alexander L. Wolf,et al.  SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures , 2016, SIGMOD Conference.

[49]  Theodore Johnson,et al.  Out-of-order processing: a new architecture for high-performance stream systems , 2008, Proc. VLDB Endow..

[50]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[51]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..