How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates

In the quest for valuable information, modern big data applications continuously monitor streams of data. These applications demand low latency stream processing even when faced with high volume and velocity of incoming changes and the user's desire to ask complex queries. In this paper, we study low-latency incremental computation of complex SQL queries in both local and distributed streaming environments. We develop a technique for the efficient incrementalization of queries with nested aggregates for batch updates. We identify the cases in which batch processing can boost the performance of incremental view maintenance but also demonstrate that tuple-at-a-time processing often can achieve better performance in local mode. Batch updates are essential for enabling distributed incremental view maintenance and amortizing the cost of network communication and synchronization. We show how to derive incremental programs optimized for running on large-scale processing platforms. Our implementation of distributed incremental view maintenance can process tens of million of tuples with few-second latency using hundreds of nodes.

[1]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[2]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[3]  Christoph Koch,et al.  Incremental query evaluation in a ring of databases , 2010, PODS.

[4]  Jennifer Widom,et al.  On-line warehouse view maintenance , 1997, SIGMOD '97.

[5]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[6]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[7]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.

[8]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[9]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[10]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[11]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[12]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[13]  Ippokratis Pandis,et al.  Impala: Eine moderne, quellen-offene SQL Engine für Hadoop , 2016 .

[14]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[15]  Jennifer Widom,et al.  A System Prototype for Warehouse View Maintenance , 1996, VIEWS.

[16]  Christoph Koch,et al.  Building Efficient Query Engines in a High-Level Language , 2014, TODS.

[17]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[18]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[19]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[20]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[21]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[22]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[23]  Milos Nikolic,et al.  DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views , 2012, Proc. VLDB Endow..

[24]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[25]  Badrish Chandramouli,et al.  Trill: A High-Performance Incremental Query Processor for Diverse Analytics , 2014, Proc. VLDB Endow..

[26]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[27]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[28]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[29]  Ramesh C. Agarwal,et al.  Block oriented processing of relational database operations in modern computer architectures , 2001, Proceedings 17th International Conference on Data Engineering.

[30]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[31]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[32]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[33]  Michael Stonebraker,et al.  S-Store: A Streaming NewSQL System for Big Velocity Applications , 2014, Proc. VLDB Endow..

[34]  Andrew W. Appel,et al.  Continuation-passing, closure-passing style , 1989, POPL '89.

[35]  Amir Shaikhha,et al.  DBToaster: higher-order delta processing for dynamic, frequently fresh views , 2012, The VLDB Journal.