One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables

Real-time data analysis and management are increasingly critical for today's businesses. SQL is the de facto lingua franca for these endeavors, yet support for robust streaming analysis and management with SQL remains limited. Many approaches restrict semantics to a reduced subset of features and/or require a suite of non-standard constructs. Additionally, use of event timestamps to provide native support for analyzing events according to when they actually occurred is not pervasive, and often comes with important limitations. We present a three-part proposal for integrating robust streaming into SQL, namely: (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. We show how with these minimal additions it is possible to utilize the complete suite of standard SQL semantics to perform robust stream processing. We motivate and illustrate these concepts using examples and describe lessons learned from implementations in Apache Calcite, Apache Flink, and Apache Beam. We conclude with syntax and semantics of a concrete proposal for extensions of the SQL standard and note further areas of exploration.

[1]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[2]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[3]  Michael Stonebraker,et al.  Retrospective on Aurora , 2004, The VLDB Journal.

[4]  Daniel Lemire,et al.  Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources , 2018, SIGMOD Conference.

[5]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[6]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[7]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[8]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[9]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[10]  Jennifer Widom,et al.  CQL: A Language for Continuous Queries over Streams and Relations , 2003, DBPL.

[11]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[12]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[13]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[14]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[15]  Calton Pu,et al.  Conquer: A Continual Query System for Update Monitoring in the WWW , 1999 .

[16]  Peter A. Tucker,et al.  NEXMark – A Benchmark for Queries over Data Streams DRAFT , 2002 .

[17]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[18]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[19]  GoldbergDavid,et al.  Continuous queries over append-only databases , 1992 .

[20]  Michael Stonebraker,et al.  Contract-Based Load Management in Federated Distributed Systems , 2004, NSDI.

[21]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[22]  Matthias Weidlich,et al.  Streams and Tables: Two Sides of the Same Coin , 2018, BIRTE.

[23]  Hiroyuki Kitagawa,et al.  Stream-Based Real World Information Integration Framework , 2010, Wireless Sensor Network Technologies for the Information Explosion Era.

[24]  Michael Stonebraker,et al.  The Aurora and Medusa Projects , 2003, IEEE Data Eng. Bull..

[25]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[26]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[27]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.