A Survey on the Evolution of Stream Processing Systems

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.

[1]  Vasiliki Kalavri,et al.  In support of workload-aware streaming state management , 2020, HotStorage.

[2]  Sherif Sakr,et al.  Stream Processing Languages in the Big Data Era , 2018, SIGMOD Rec..

[3]  Badrish Chandramouli,et al.  FASTER: A Concurrent Key-Value Store with In-Place Updates , 2018, SIGMOD Conference.

[4]  Bugra Gedik Partitioning functions for stateful data parallelism in stream processing , 2013, The VLDB Journal.

[5]  Thomas S. Heinze,et al.  An adaptive replication scheme for elastic data stream processing systems , 2015, DEBS.

[6]  Vladimir Vlassov,et al.  Hubbub-Scale: Towards Reliable Elastic Scaling under Multi-tenancy , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[7]  Theodore Johnson,et al.  Out-of-order processing: a new architecture for high-performance stream systems , 2008, Proc. VLDB Endow..

[8]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[9]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[10]  Kian-Lee Tan,et al.  ChronoStream: Elastic stateful stream computation in the cloud , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[11]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[12]  James R. Larus,et al.  Orleans: cloud computing for everyone , 2011, SoCC.

[13]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[14]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[15]  Vasiliki Kalavri,et al.  Megaphone: Latency-conscious state migration for distributed streaming dataflows , 2018, Proc. VLDB Endow..

[16]  Torsten Hoefler,et al.  Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism , 2019, ArXiv.

[17]  Peter R. Pietzuch,et al.  Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications , 2019, SoCC.

[18]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[19]  Kun-Lung Wu,et al.  Consistent Regions: Guaranteed Tuple Processing in IBM Streams , 2016, Proc. VLDB Endow..

[20]  Pat Hanrahan,et al.  Fleet: A Framework for Massively Parallel Streaming on FPGAs , 2020, ASPLOS.

[21]  Jonathan Goldstein,et al.  Consistent Streaming Through Time: A Vision for Event Stream Processing , 2006, CIDR.

[22]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[23]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[24]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[25]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[26]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[27]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[28]  Vasiliki Kalavri,et al.  Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows , 2018, OSDI.

[29]  Timothy Roscoe,et al.  Shared Arrangements: practical inter-query sharing for streaming dataflows , 2020, Proc. VLDB Endow..

[30]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[31]  Seif Haridi,et al.  Arcon: Continuous and Deep Data Stream Analytics , 2019, BIRTE.

[32]  Indranil Gupta,et al.  Stela: Enabling Stream Processing Systems to Scale-in and Scale-out On-demand , 2016, 2016 IEEE International Conference on Cloud Engineering (IC2E).

[33]  Wenguang Chen,et al.  LiveGraph , 2019, Proc. VLDB Endow..

[34]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[35]  Rajiv Ranjan,et al.  Elasticity management of Streaming Data Analytics Flows on clouds , 2017, J. Comput. Syst. Sci..

[36]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[37]  Alessandro Margara,et al.  Processing flows of information: From data stream to complex event processing , 2012, CSUR.

[38]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[39]  Tiziano De Matteis,et al.  Elastic Scaling for Distributed Latency-Sensitive Data Stream Operators , 2017, 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[40]  Badrish Chandramouli,et al.  Shrink - Prescribing Resiliency Solutions for Streaming , 2017, Proc. VLDB Endow..

[41]  Christof Fetzer,et al.  Auto-scaling techniques for elastic data stream processing , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[42]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[43]  Seif Haridi,et al.  State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing , 2017, Proc. VLDB Endow..

[44]  Li Su,et al.  Tolerating correlated failures in Massively Parallel Stream Processing Engines , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[45]  Badrish Chandramouli,et al.  Impatience Is a Virtue: Revisiting Disorder in High-Performance Log Analytics , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[46]  Jonathan Leibiusky,et al.  Getting Started with Storm , 2012 .

[47]  Fan Ye,et al.  An empirical study of high availability in stream processing systems , 2009, Middleware.

[48]  Asterios Katsifodimos,et al.  Stateful Functions as a Service in Action , 2019, Proc. VLDB Endow..

[49]  Jeffrey Davis,et al.  Continuous analytics over discontinuous streams , 2010, SIGMOD Conference.

[50]  Volker Markl,et al.  A survey of state management in big data processing systems , 2017, The VLDB Journal.

[51]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[52]  Srinath Perera,et al.  Recent Advancements in Event Processing , 2018, ACM Comput. Surv..

[53]  Martín Abadi,et al.  Incremental, iterative data processing with timely dataflow , 2016, Commun. ACM.

[54]  Jeyhun Karimov,et al.  Analyzing Efficient Stream Processing on Modern Hardware , 2019, Proc. VLDB Endow..

[55]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[56]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[57]  Badrish Chandramouli,et al.  The extensibility framework in Microsoft StreamInsight , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[58]  Jean Bacon,et al.  SEEP: scalable and elastic event processing , 2010, Middleware Posters '10.

[59]  Tilmann Rabl,et al.  Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines , 2020, LWDA.

[60]  Ying Xing,et al.  A Cooperative, Self-Configuring High-Availability Solution for Stream Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[61]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[62]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[63]  Song Liu,et al.  Load shedding in stream databases: a control-based approach , 2006, VLDB.

[64]  H. T. Kung,et al.  Credit-based flow control for ATM networks: credit update protocol, adaptive credit allocation and statistical multiplexing , 1994, SIGCOMM.

[65]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[66]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[67]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[68]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[69]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[70]  Vladimir Vlassov,et al.  Streaming Graph Partitioning: An Experimental Study , 2018, Proc. VLDB Endow..

[71]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[72]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[73]  Stanley B. Zdonik,et al.  Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing , 2007, VLDB.

[74]  Alexandros Labrinidis,et al.  Concept-Driven Load Shedding: Reducing Size and Error of Voluminous and Variable Data Streams , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[75]  Albert G. Greenberg,et al.  Fault-tolerant stream processing using a distributed, replicated file system , 2008, Proc. VLDB Endow..

[76]  Odej Kao,et al.  Elastic Stream Processing with Latency Guarantees , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[77]  Fan Ye,et al.  A Hybrid Approach to High Availability in Stream Processing Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[78]  Yin Yang,et al.  DRS: Auto-Scaling for Real-Time Stream Analytics , 2017, IEEE/ACM Transactions on Networking.

[79]  Sriram Rao,et al.  Dhalion: Self-Regulating Stream Processing in Heron , 2017, Proc. VLDB Endow..

[80]  Michael Philippsen,et al.  Reliable speculative processing of out-of-order event streams in generic publish/subscribe middlewares , 2013, DEBS '13.

[81]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[82]  Alexander L. Wolf,et al.  SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures , 2016, SIGMOD Conference.

[84]  Seif Haridi,et al.  Lightweight Asynchronous Snapshots for Distributed Dataflows , 2015, ArXiv.

[85]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[86]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[87]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[88]  Theodore Johnson,et al.  A Heartbeat Mechanism and Its Application in Gigascope , 2005, VLDB.

[89]  Jennifer Widom,et al.  Adaptive ordering of pipelined stream filters , 2004, SIGMOD '04.

[90]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[91]  Ruben Mayer,et al.  A Comprehensive Survey on Parallelization and Elasticity in Stream Processing , 2019, ACM Comput. Surv..

[92]  Asterios Katsifodimos,et al.  Operational Stream Processing: Towards Scalable and Consistent Event-Driven Applications , 2019, EDBT.

[93]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[94]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[95]  Feng Zhang,et al.  Hardware-Conscious Stream Processing , 2020, SIGMOD Rec..

[96]  Wei Lin,et al.  StreamScope: Continuous Reliable Distributed Processing of Big Data Streams , 2016, NSDI.

[97]  Joseph M. Hellerstein,et al.  Online Dynamic Reordering for Interactive Data Processing , 1999, VLDB.

[98]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[99]  Navendu Jain,et al.  Adaptive Control of Extreme-scale Stream Processing Systems , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[100]  David Maier,et al.  Semantics of Data Streams and Operators , 2005, ICDT.

[101]  Alastair R. Beresford,et al.  Online Event Processing: Achieving Consistency Where Distributed Transactions Have Failed , 2019 .

[102]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[103]  Paris Carbone Scalable and Reliable Data Stream Processing , 2018 .

[104]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[105]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[106]  Raul Castro Fernandez,et al.  Making State Explicit for Imperative Big Data Processing , 2014, USENIX Annual Technical Conference.

[107]  Michael Stonebraker,et al.  S-Store: A Streaming NewSQL System for Big Velocity Applications , 2014, Proc. VLDB Endow..

[108]  Michael J. Franklin,et al.  Dynamic Pipeline Scheduling for Improving Interactive Query Performance , 2001, VLDB.

[109]  Kenneth Knowles,et al.  One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables , 2019, SIGMOD Conference.

[110]  Stanley B. Zdonik,et al.  Window-aware load shedding for aggregation queries over data streams , 2006, VLDB.

[111]  Rajeev Rastogi,et al.  Data Stream Management: Processing High-Speed Data Streams (Data-Centric Systems and Applications) , 2019 .

[112]  Edward A. Lee,et al.  AWStream: adaptive wide-area streaming analytics , 2018, SIGCOMM.

[113]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[114]  Kun-Lung Wu,et al.  Elastic Scaling for Data Stream Processing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[115]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[116]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.