Arc: an IR for batch and stream programming

In big data analytics, there is currently a large number of data programming models and their respective frontends such as relational tables, graphs, tensors, and streams. This has lead to a plethora of runtimes that typically focus on the efficient execution of just a single frontend. This fragmentation manifests itself today by highly complex pipelines that bundle multiple runtimes to support the necessary models. Hence, joint optimization and execution of such pipelines across these frontend-bound runtimes is infeasible. We propose Arc as the first unified Intermediate Representation (IR) for data analytics that incorporates stream semantics based on a modern specification of streams, windows and stream aggregation, to combine batch and stream computation models. Arc extends Weld, an IR for batch computation and adds support for partitioned, out-of-order stream and window operators which are the most fundamental building blocks in contemporary data streaming.

[1]  Kun-Lung Wu,et al.  General Incremental Sliding-Window Aggregation , 2015, Proc. VLDB Endow..

[2]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[3]  David Maier,et al.  Semantics of Data Streams and Operators , 2005, ICDT.

[4]  Seif Haridi,et al.  Cutty: Aggregate Sharing for User-Defined Windows , 2016, CIKM.

[5]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[6]  Robert Grimm,et al.  River: an intermediate language for stream processing , 2016, Softw. Pract. Exp..

[7]  Theodore Johnson,et al.  Out-of-order processing: a new architecture for high-performance stream systems , 2008, Proc. VLDB Endow..

[8]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[9]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[10]  Saman P. Amarasinghe,et al.  A Common Runtime for High Performance Data Analysis , 2017, CIDR.

[11]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[12]  Peter A. Tucker,et al.  NEXMark – A Benchmark for Queries over Data Streams DRAFT , 2002 .

[13]  Timos K. Sellis,et al.  Window Specification over Data Streams , 2006, EDBT Workshops.

[14]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[15]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[16]  Benjamin C. Pierce,et al.  Advanced Topics In Types And Programming Languages , 2004 .

[17]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[18]  Jennifer Widom,et al.  Flexible time management in data stream systems , 2004, PODS.

[19]  Samuel Madden,et al.  Evaluating End-to-End Optimization for Data Analytics Applications in Weld , 2018, Proc. VLDB Endow..

[20]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[21]  Tilmann Rabl,et al.  Efficient Window Aggregation with General Stream Slicing , 2019, EDBT.