Synchronization Schemas

We present a type-theoretic framework for data stream processing for real-time decision making, where the desired computation involves a mix of sequential computation, such as smoothing and detection of peaks and surges, and naturally parallel computation, such as relational operations, key-based partitioning, and map-reduce. Our framework unifies sequential (ordered) and relational (unordered) data models. In particular, we define synchronization schemas as types, and series-parallel streams (SPS) as objects of these types. A synchronization schema imposes a hierarchical structure over relational types that succinctly captures ordering and synchronization requirements among different kinds of data items. Series-parallel streams naturally model objects such as relations, sequences, sequences of relations, sets of streams indexed by key values, time-based and event-based windows, and more complex structures obtained by nesting of these. We introduce series-parallel stream transformers (SPST) as a domain-specific language for modular specification of deterministic transformations over such streams. SPSTs provably specify only monotonic transformations allowing streamability, have a modular structure that can be exploited for correct parallel implementation, and are composable allowing specification of complex queries as a pipeline of transformations.

[1]  C. Zaniolo,et al.  Expressing and optimizing sequence queries in database systems , 2004, TODS.

[2]  David Maier,et al.  Frames: data-driven windows , 2016, DEBS.

[3]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[4]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[5]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[6]  Pascal Raymond,et al.  The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[7]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[8]  David Toman,et al.  Fundamentals of Physical Design and Query Compilation , 2011, Fundamentals of Physical Design and Query Compilation.

[9]  Peter A. Tucker,et al.  NEXMark – A Benchmark for Queries over Data Streams DRAFT , 2002 .

[10]  Frank Neven,et al.  Automata, Logic, and XML , 2002, CSL.

[11]  Kun-Lung Wu,et al.  IBM Streams Processing Language: Analyzing Big Data in motion , 2013, IBM J. Res. Dev..

[12]  Wojciech Zielonka,et al.  The Book of Traces , 1995 .

[13]  Tova Milo,et al.  An Algebra for Pomsets , 1995, ICDT.

[14]  Kun-Lung Wu,et al.  Safe Data Parallelism for General Streaming , 2015, IEEE Transactions on Computers.

[15]  Rajeev Alur,et al.  Stream processing with dependency-guided synchronization , 2022, PPoPP.

[16]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[17]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[18]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[19]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[20]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[21]  Alexander Artikis,et al.  Complex event recognition in the Big Data era: a survey , 2019, The VLDB Journal.

[22]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[23]  Rajeev Alur,et al.  Regular Programming for Quantitative Properties of Data Streams , 2016, ESOP.

[24]  Theodore Johnson,et al.  Out-of-order processing: a new architecture for high-performance stream systems , 2008, Proc. VLDB Endow..

[25]  David Maier,et al.  Semantics and evaluation techniques for window aggregates in data streams , 2005, SIGMOD '05.

[26]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[27]  Martin Hirzel,et al.  Partition and compose: parallel complex event processing , 2012, DEBS.

[28]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[29]  Margus Veanes,et al.  Rex: Symbolic Regular Expression Explorer , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[30]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[31]  Rajeev Alur,et al.  DiffStream: differential output testing for stream processing programs , 2020, Proc. ACM Program. Lang..

[32]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[33]  Thomas Schwentick,et al.  Automata for XML - A survey , 2007, J. Comput. Syst. Sci..

[34]  Rajeev Alur,et al.  Modular quantitative monitoring , 2019, Proc. ACM Program. Lang..

[35]  S. Sudarshan,et al.  Aggregation and Relevance in Deductive Databases , 1991, VLDB.

[36]  Byron Cook,et al.  Formal Reasoning About the Security of Amazon Web Services , 2018, CAV.

[37]  Paul Hudak,et al.  Functional reactive programming from first principles , 2000, PLDI '00.

[38]  Jonathan Goldstein,et al.  Consistent Streaming Through Time: A Vision for Event Stream Processing , 2006, CIDR.

[39]  Konstantinos Mamouras,et al.  Semantic Foundations for Deterministic Dataflow and Stream Processing , 2020, ESOP.

[40]  Johannes Gehrke,et al.  Cayuga: a high-performance event processing engine , 2007, SIGMOD '07.

[41]  Kenneth Knowles,et al.  One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables , 2019, SIGMOD Conference.

[42]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[43]  Kyuseok Shim,et al.  Including Group-By in Query Optimization , 1994, VLDB.

[44]  Rajeev Alur,et al.  Streamable Regular Transductions , 2020, Theor. Comput. Sci..

[45]  Rajeev Alur,et al.  Data-trace types for distributed stream processing systems , 2019, PLDI.

[46]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[47]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[48]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[49]  Vaughan R. Pratt,et al.  Modeling concurrency with partial orders , 1986, International Journal of Parallel Programming.

[50]  Jeffrey Davis,et al.  Continuous analytics over discontinuous streams , 2010, SIGMOD Conference.

[51]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[52]  R. Alur,et al.  Adding nesting structure to words , 2006, JACM.

[53]  Yanlei Diao,et al.  SASE: Complex Event Processing over Streams , 2006, ArXiv.

[54]  Rajeev Alur,et al.  StreamQRE: modular specification and efficient evaluation of quantitative queries over streaming data , 2017, PLDI.

[55]  Badrish Chandramouli,et al.  The extensibility framework in Microsoft StreamInsight , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[56]  Martin Hirzel,et al.  SPL: An Extensible Language for Distributed Stream Processing , 2017, ACM Trans. Program. Lang. Syst..

[57]  Nick James,et al.  COVID-19 in the United States: Trajectories and second surge behavior , 2020, Chaos.

[58]  Jennifer Widom,et al.  Adaptive ordering of pipelined stream filters , 2004, SIGMOD '04.

[59]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[60]  Boon Thau Loo,et al.  Recursive Computation of Regions and Connectivity in Networks , 2009, 2009 IEEE 25th International Conference on Data Engineering.