SamzaSQL: Scalable Fast Data Management with Streaming SQL

As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high velocity streams of data. Technologies such as distributed message queues and streaming processing platforms that can scale to thousands of data stream partitions on commodity hardware are a response. However, the programming API provided by these systems is often low-level, requiring substantial custom code that adds to the programmer learning curve and maintenance overhead. Additionally, these systems often lack SQL querying capabilities that have proven popular on Big Data systems like Hive, Impala or Presto. We define a minimal set of extensions to standard SQL for data stream querying and manipulation. These extensions are prototyped in SamzaSQL, a new tool for streaming SQL that compiles streaming SQL into physical plans that are executed on Samza, an open-source distributed stream processing framework. We compare the performance of streaming SQL queries against native Samza applications and discuss usability improvements. SamzaSQL is a part of the open source Apache Samza project and will be available for general use.

[1]  Michael Kaminsky,et al.  Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles , 2013, SOSP 2013.

[2]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[3]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[4]  Beth Plale Leveraging run time knowledge about event rates to improve memory utilization in wide area data stream filtering , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[5]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[6]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[7]  Karsten Schwan,et al.  Software approach to hazard detection using on-line analysis of safety constraints , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[8]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[9]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[10]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[11]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[12]  Karsten Schwan,et al.  Dynamic Querying of Streaming Data with the dQUOB System , 2003, IEEE Trans. Parallel Distributed Syst..

[13]  Lei Gao,et al.  All aboard the Databus!: Linkedin's scalable consistent change data capture platform , 2012, SoCC '12.

[14]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[15]  Himanshu Gupta,et al.  The RADStack: Open Source Lambda Architecture for Interactive Analytics , 2017, HICSS.

[16]  Jun Rao,et al.  Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[17]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[18]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[19]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[20]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.