Data-trace types for distributed stream processing systems

Distributed architectures for efficient processing of streaming data are increasingly critical to modern information processing systems. The goal of this paper is to develop type-based programming abstractions that facilitate correct and efficient deployment of a logical specification of the desired computation on such architectures. In the proposed model, each communication link has an associated type specifying tagged data items along with a dependency relation over tags that captures the logical partial ordering constraints over data items. The semantics of a (distributed) stream processing system is then a function from input data traces to output data traces, where a data trace is an equivalence class of sequences of data items induced by the dependency relation. This data-trace transduction model generalizes both acyclic synchronous data-flow and relational query processors, and can specify computations over data streams with a rich variety of partial ordering and synchronization characteristics. We then describe a set of programming templates for data-trace transductions: abstractions corresponding to common stream processing tasks. Our system automatically maps these high-level programs to a given topology on the distributed implementation platform Apache Storm while preserving the semantics. Our experimental evaluation shows that (1) while automatic parallelization deployed by existing systems may not preserve semantics, particularly when the computation is sensitive to the ordering of data items, our programming abstractions allow a natural specification of the query that contains a mix of ordering constraints while guaranteeing correct deployment, and (2) the throughput of the automatically compiled distributed code is comparable to that of hand-crafted distributed implementations.

[1]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[2]  Kun-Lung Wu,et al.  General Incremental Sliding-Window Aggregation , 2015, Proc. VLDB Endow..

[3]  Rajeev Alur,et al.  An Introduction to the StreamQRE Language , 2017, Dependable Software Systems Engineering.

[4]  Theodore Johnson,et al.  Out-of-order processing: a new architecture for high-performance stream systems , 2008, Proc. VLDB Endow..

[5]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[6]  Ezio Bartocci,et al.  Quantitative Regular Expressions for Arrhythmia Detection , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Anca Muscholl,et al.  Trace Theory , 2011, Encyclopedia of Parallel Computing.

[9]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[10]  Rajeev Alur,et al.  Automata-Based Stream Processing , 2017, ICALP.

[11]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[12]  Kun-Lung Wu,et al.  Safe Data Parallelism for General Streaming , 2015, IEEE Transactions on Computers.

[13]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Rajeev Alur,et al.  Derivatives of Quantitative Regular Expressions , 2017, Models, Algorithms, Logics and Tools.

[16]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[17]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[18]  Rajeev Alur,et al.  Modular quantitative monitoring , 2019, Proc. ACM Program. Lang..

[19]  Rajeev Alur,et al.  Streamable Regular Transductions , 2020, Theor. Comput. Sci..

[20]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[21]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[22]  Rajeev Alur,et al.  StreamQRE: modular specification and efficient evaluation of quantitative queries over streaming data , 2017, PLDI.

[23]  Badrish Chandramouli,et al.  The extensibility framework in Microsoft StreamInsight , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[24]  Kun-Lung Wu,et al.  IBM Streams Processing Language: Analyzing Big Data in motion , 2013, IBM J. Res. Dev..

[25]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[26]  Tova Milo,et al.  An Algebra for Pomsets , 1995, ICDT.

[27]  Martin Hirzel,et al.  Low-Latency Sliding-Window Aggregation in Worst-Case Constant Time , 2017, DEBS.

[28]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[29]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[30]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[31]  Rajeev Alur,et al.  Interfaces for Stream Processing Systems , 2018, Principles of Modeling.

[32]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[33]  Rajeev Alur,et al.  Regular Programming for Quantitative Properties of Data Streams , 2016, ESOP.

[34]  Rajeev Alur,et al.  Real-Time Decision Policies With Predictable Performance , 2018, Proceedings of the IEEE.

[35]  Stephen A. Edwards,et al.  The synchronous languages 12 years later , 2003, Proc. IEEE.

[36]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[37]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[38]  Tova Milo,et al.  An Algebra for Pomsets , 1999, Inf. Comput..

[39]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[40]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[41]  Jennifer Widom,et al.  Incremental computation and maintenance of temporal aggregates , 2001, Proceedings 17th International Conference on Data Engineering.

[42]  David Maier,et al.  Semantics and evaluation techniques for window aggregates in data streams , 2005, SIGMOD '05.

[43]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[44]  Zhuo Liu,et al.  Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[45]  Jonathan Goldstein,et al.  Consistent Streaming Through Time: A Vision for Event Stream Processing , 2006, CIDR.

[46]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[47]  Jeffrey Davis,et al.  Continuous analytics over discontinuous streams , 2010, SIGMOD Conference.

[48]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[49]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[50]  Vaughan R. Pratt,et al.  Modeling concurrency with partial orders , 1986, International Journal of Parallel Programming.

[51]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..