Stream processing with dependency-guided synchronization

Real-time data processing applications with low latency requirements have led to the increasing popularity of stream processing systems. While such systems offer convenient APIs that can be used to achieve data parallelism automatically, they offer limited support for computations that require synchronization between parallel nodes. In this paper, we propose dependency-guided synchronization (DGS), an alternative programming model for stateful streaming computations with complex synchronization requirements. In the proposed model, the input is viewed as partially ordered, and the program consists of a set of parallelization constructs which are applied to decompose the partial order and process events independently. Our programming model maps to an execution model called synchronization plans which supports synchronization between parallel nodes. Our evaluation shows that APIs offered by two widely used systems— Flink and Timely Dataflow—cannot suitably expose parallelism in some representative applications. In contrast, DGS enables implementations with scalable performance, the resulting synchronization plans offer throughput improvements when implemented manually in existing systems, and the programming overhead is small compared to writing sequential code. CCS Concepts: • Software and its engineering→ Parallel programming languages; Domain specific languages; • Information systems→ Stream management. ∗Equal contribution. †now at Google. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the owner/author(s). PPoPP ’22, February 12–16, 2022, Seoul, Republic of Korea © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9204-4/22/02. https://doi.org/10.1145/3503221.3508413

[1]  Asterios Katsifodimos,et al.  Stateful Functions as a Service in Action , 2019, Proc. VLDB Endow..

[2]  Michael Stonebraker,et al.  S-Store: Streaming Meets Transaction Processing , 2015, Proc. VLDB Endow..

[3]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[4]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[5]  Philip A. Bernstein,et al.  Serverless Event-Stream Processing over Virtual Actors , 2019, CIDR.

[6]  Rajeev Alur,et al.  Synchronization Schemas , 2021, PODS.

[7]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[8]  Arjun Radhakrishna,et al.  Sequential programming for replicated data stores , 2019, Proc. ACM Program. Lang..

[9]  Badrish Chandramouli,et al.  High-performance dynamic pattern matching over disordered streams , 2010, Proc. VLDB Endow..

[10]  Cheng Li,et al.  Making geo-replicated systems fast as possible, consistent when necessary , 2012, OSDI 2012.

[11]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[12]  Holger Ziekow,et al.  The DEBS 2014 grand challenge , 2014, DEBS '14.

[13]  Alessandro Margara,et al.  TSpoon: Transactions on a stream processor , 2020, J. Parallel Distributed Comput..

[14]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[15]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[16]  Ryan Newton,et al.  Freeze after writing: quasi-deterministic parallel programming with LVars , 2014, POPL.

[17]  Paris Carbone,et al.  Beyond Analytics: The Evolution of Stream Processing Systems , 2020, SIGMOD Conference.

[18]  David Maier,et al.  Indexing in an Actor-Oriented Database , 2017, CIDR.

[19]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[20]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[21]  Jennifer Widom,et al.  Towards a streaming SQL standard , 2008, Proc. VLDB Endow..

[22]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[23]  George F. Riley,et al.  The ns-3 Network Simulator , 2010, Modeling and Tools for Network Simulation.

[24]  Rajeev Alur,et al.  StreamQRE: modular specification and efficient evaluation of quantitative queries over streaming data , 2017, PLDI.

[25]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[26]  David Maier,et al.  Logic and lattices for distributed programming , 2012, SoCC '12.

[27]  Marcos Antonio Vaz Salles,et al.  Reactors: A Case for Predictable, Virtualized Actor Database Systems , 2017, SIGMOD Conference.

[28]  Kun-Lung Wu,et al.  IBM Streams Processing Language: Analyzing Big Data in motion , 2013, IBM J. Res. Dev..

[29]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[30]  Hongseok Yang,et al.  'Cause I'm strong enough: Reasoning about consistency choices in distributed systems , 2016, POPL.

[31]  David Maier,et al.  Exploiting Punctuation Semantics in Continuous Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[32]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[33]  Andrew C. Myers,et al.  MixT: a language for mixing consistency in geodistributed transactions , 2018, PLDI.

[34]  Suresh Jagannathan,et al.  Declarative programming over eventually consistent data stores , 2015, PLDI.

[35]  Fabian Hueske,et al.  Apache Flink , 2019, Encyclopedia of Big Data Technologies.

[36]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[37]  Rajeev Alur,et al.  DiffStream: differential output testing for stream processing programs , 2020, Proc. ACM Program. Lang..

[38]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[39]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[40]  Rajeev Alur,et al.  Data-trace types for distributed stream processing systems , 2019, PLDI.

[41]  Joe Armstrong,et al.  Concurrent programming in ERLANG , 1993 .

[42]  Rajeev Alur,et al.  Stream processing with dependency-guided synchronization , 2022, PPoPP.

[43]  Michael Philippsen,et al.  Predictive load management in smart grid environments , 2014, DEBS '14.

[44]  Kun-Lung Wu,et al.  Safe Data Parallelism for General Streaming , 2015, IEEE Transactions on Computers.

[45]  Ryan Newton,et al.  LVars: lattice-based data structures for deterministic parallelism , 2013, FHPC '13.

[46]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[47]  Yi Pan,et al.  SamzaSQL: Scalable Fast Data Management with Streaming SQL , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[48]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[49]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[50]  Theodore Johnson,et al.  A Heartbeat Mechanism and Its Application in Gigascope , 2005, VLDB.

[51]  Indranil Gupta,et al.  Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo , 2020, NSDI.

[52]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[53]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[54]  Robert Grimm,et al.  A Universal Calculus for Stream Processing Languages , 2010, ESOP.

[55]  Paramvir Bahl,et al.  VideoEdge: Processing Camera Streams using Hierarchical Clusters , 2018, 2018 IEEE/ACM Symposium on Edge Computing (SEC).

[56]  Marc Shapiro,et al.  Conflict-Free Replicated Data Types , 2011, SSS.

[57]  Sebastian Burckhardt,et al.  Concurrent programming with revisions and isolation types , 2010, OOPSLA.

[58]  Seif Haridi,et al.  State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing , 2017, Proc. VLDB Endow..

[59]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[60]  Kenneth Knowles,et al.  One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables , 2019, SIGMOD Conference.