Compilation Techniques for Incremental Collection Processing

Many map-reduce frameworks as well as NoSQL systems rely on collection programming as their interface of choice due to its rich semantics along with an easily parallelizable set of primitives. Unfortunately, the potential of collection programming is not entirely fulfilled by current systems as they lack efficient incremental view maintenance (IVM) techniques for queries producing large nested results. This comes as a consequence of the fact that the nesting of collections does not enjoy the same algebraic properties underscoring the optimization potential of typical collection processing constructs. We propose the first solution for the efficient incrementalization of collection programming in terms of its core constructs as captured by the positive nested relational calculus (NRC+) on bags (with integer multiplicities). We take an approach based on delta query derivation, whose goal is to generate delta queries which, given a small change in the input, can update the materialized view more efficiently than via recomputation. More precisely, we model the cost of NRC+ operators and classify queries as efficiently incrementalizable if their delta has a strictly lower cost than full re-evaluation. Then, we identify IncNRC+, a large fragment of NRC+ that is efficiently incrementalizable and we provide a semantics-preserving translation that takes any NRC+ query to a collection of IncNRC+ queries. Furthermore, we prove that incrementalmaintenance for NRC+ is within the complexity class NC0 and we showcase how Recursive IVM, a technique that has provided significant speedups over traditional IVM in the case of flat queries, can also be applied to IncNRC+ . Existing systems are also limited wrt. the size of inner collections that they can effectively handle before running into severe performance bottlenecks. In particular, in the face of nested collections with skewed cardinalities developers typically have to undergo a painful process of manual query re-writes in order to ensure that the largest inner collections in their workloads are not impacted by these limitations. To address these issues we developed SLeNDer, a compilation framework that given a nested query generates a set of semantically equivalent (partially) shredded queries that can be efficiently evaluated and incrementalized using state of the art techniques for handling skew and applying delta changes, respectively. The derived queries expose nested collections to the same opportunities for distributing their processing and incrementally updating their contents as those enjoyed by top-level collections, leading on our benchmark to up to 16.8x and 21.9x speedups in terms of offline and online processing, respectively. In order to enable efficient IVM for the increasingly common case of collection programming with functional values as in Links, we also discuss the efficient incrementalization of simplytyped lambda calculi, under the constraint that their primitives are themselves efficiently incrementalizable.

[1]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[2]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[3]  Torsten Grust,et al.  Avalanche-safe LINQ compilation , 2010, Proc. VLDB Endow..

[4]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[5]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[6]  Bruce G. Lindsay,et al.  How to roll a join: asynchronous incremental view maintenance , 2000, SIGMOD '00.

[7]  Latha S. Colby,et al.  Algorithms for deferred view maintenance , 1996, SIGMOD '96.

[8]  James Cheney,et al.  Query shredding: efficient relational evaluation of queries over nested multisets , 2014, SIGMOD Conference.

[9]  Elke A. Rundensteiner,et al.  An algebraic approach for incremental maintenance of materialized XQuery views , 2002, WIDM '02.

[10]  Uzi Vishkin,et al.  Simulation of Parallel Random Access Machines by Circuits , 1984, SIAM J. Comput..

[11]  Martin Odersky,et al.  Higher-Order Reactive Programming with Incremental Lists , 2013, ECOOP.

[12]  Inderpal Singh Mumick,et al.  Counting solutions to the View Maintenance Problem , 1992, Workshop on Deductive Databases, JICSLP.

[13]  Kenneth A. Ross,et al.  Implementing Incremental View Maintenance in Nested Data Models , 1997, DBPL.

[14]  Christoph Koch,et al.  Incremental query evaluation in a ring of databases , 2010, PODS.

[15]  Jan Van den Bussche,et al.  Well-defined NRC queries can be typed (Extended Abstract) , 2013 .

[16]  Jan Van den Bussche,et al.  Simulation of the nested relational algebra by the flat relational algebra, with an application to the complexity of evaluating powerset algebra expressions , 2001, Theor. Comput. Sci..

[17]  Simon L. Peyton Jones Harnessing the Multicores: Nested Data Parallelism in Haskell , 2008, APLAS.

[18]  Elke A. Rundensteiner,et al.  Order-Sensitive View Maintenance of Materialized XQuery Views , 2003, ER.

[19]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[20]  Guy E. Blelloch,et al.  Adaptive functional programming , 2002, POPL '02.

[21]  Nick Roussopoulos,et al.  An incremental access method for ViewCache: concept, algorithms, and cost analysis , 1991, TODS.

[22]  Klaus Ostermann,et al.  A theory of changes for higher-order languages: incrementalizing λ-calculi by static differentiation , 2013, PLDI.

[23]  Yanhong A. Liu,et al.  Static caching for incremental computation , 1998, TOPL.

[24]  Umut A. Acar,et al.  Imperative self-adjusting computation , 2008, POPL '08.

[25]  Jan Van den Bussche,et al.  Well-definedness and semantic type-checking for the nested relational calculus , 2007, Theor. Comput. Sci..

[26]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[27]  Hamid Pirahesh,et al.  Incremental Maintenance for Non-Distributive Aggregate Functions , 2002, VLDB.

[28]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[29]  Jingren Zhou,et al.  Efficient Maintenance of Materialized Outer-Join Views , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Limsoon Wong,et al.  Query Languages for Bags and Aggregate Functions , 1997, J. Comput. Syst. Sci..

[31]  Milos Nikolic,et al.  DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views , 2012, Proc. VLDB Endow..

[32]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[33]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[34]  David Maier,et al.  Optimizing object queries using an effective calculus , 2000, TODS.

[35]  Umut A. Acar Self-adjusting computation: (an overview) , 2009, PEPM '09.

[36]  Kenneth A. Ross,et al.  Supporting multiple view maintenance policies , 1997, SIGMOD '97.

[37]  Hiroaki Nakamura Incremental computation of complex object queries , 2001, OOPSLA '01.

[38]  H. James Hoover,et al.  Limits to Parallel Computation: P-Completeness Theory , 1995 .

[39]  Jennifer Widom,et al.  Deriving Production Rules for Incremental View Maintenance , 1991, VLDB.

[40]  Yanhong A. Liu,et al.  Efficiency by Incrementalization: An Introduction , 2000, High. Order Symb. Comput..

[41]  Neil Immerman,et al.  On Uniformity within NC¹ , 1990, J. Comput. Syst. Sci..

[42]  Torsten Grust,et al.  Incremental Updates for Materialized OQL Views , 1997, DOOD.

[43]  Milos Nikolic,et al.  How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates , 2016, SIGMOD Conference.

[44]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[45]  Dimitra Vista,et al.  Integration of Incremental View Maintenance into Query Optimizers , 1998, EDBT.

[46]  David S. Johnson,et al.  A Catalog of Complexity Classes , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[47]  Inderpal Singh Mumick,et al.  Incremental Maintenance Of Views With Duplicates , 1999 .

[48]  Mukesh K. Mohania,et al.  Incremental Evaluation of Nest and Unnest Operators in Nested Relations , 1999, CODAS.

[49]  Guy E. Blelloch,et al.  Traceable data types for self-adjusting computation , 2010, PLDI '10.

[50]  Christoph Koch,et al.  On the complexity of nonrecursive XQuery and functional query languages on complex values , 2006, TODS.

[51]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[52]  Dan Suciu,et al.  Bounded Fixpoints for Complex Objects , 1993, Theor. Comput. Sci..

[53]  Eric K. Clemons,et al.  Efficiently monitoring relational databases , 1979, ACM Trans. Database Syst..

[54]  Umut A. Acar,et al.  A cost semantics for self-adjusting computation , 2009, POPL '09.

[55]  Torsten Grust,et al.  FERRY: database-supported program execution , 2009, SIGMOD Conference.

[56]  Val Tannen,et al.  A Calculus for Collections and Aggregates , 1997, Category Theory and Computer Science.

[57]  Michael Isard,et al.  Differential Dataflow , 2013, CIDR.

[58]  Robert Paige,et al.  Finite Differencing of Computable Expressions , 1982, TOPL.

[59]  Thomas Schwentick,et al.  Dynamic conjunctive queries , 2017, Journal of computer and system sciences (Print).

[60]  Dirk Van Gucht,et al.  Converting nested algebra expressions into flat algebra expressions , 1992, TODS.

[61]  Ravi B. Konuru,et al.  An Algebraic Approach to View Maintenance for XQuery , 2008, PLAN-X.

[62]  Dan Suciu,et al.  Efficient compilation of high-level data parallel algorithms , 1994, SPAA '94.

[63]  Dan Suciu,et al.  Deciding containment for queries with complex objects (extended abstract) , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[64]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[65]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[66]  James Cheney,et al.  Row-based effect types for database integration , 2012, TLDI '12.

[67]  Susan B. Davidson,et al.  Specifying updates in biomedical databases , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[68]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[69]  Dan Suciu,et al.  A Query Language for NC , 1994, LCC.