Revisiting Reuse in Main Memory Database Systems

Reusing intermediates in databases to speed-up analytical query processing was studied in prior work. Existing solutions require intermediate results of individual operators to be materialized using materialization operators. However, inserting such materialization operations into a query plan not only incurs additional execution costs but also often eliminates important cache- and register-locality opportunities, resulting in even higher performance penalties. This paper studies a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuse-aware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our prototype implementation, demonstrate performance gains of 2x for typical analytical workloads with no additional overhead for materializing intermediates.

[1]  Kun Gao,et al.  Simultaneous Pipelining in QPipe: Exploiting Work Sharing Opportunities Across Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Stratis Viglas,et al.  Recycling in pipelined query evaluation , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[3]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[4]  Surajit Chaudhuri,et al.  An Online Approach to Physical Design Tuning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[6]  Anastasia Ailamaki,et al.  QPipe: a simultaneously pipelined relational query engine , 2005, SIGMOD '05.

[7]  Beng Chin Ooi,et al.  Cache-on-demand: recycling with certainty , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Martin L. Kersten,et al.  An architecture for recycling intermediates in a column-store , 2009, SIGMOD Conference.

[9]  Kenneth A. Ross,et al.  Cache-Conscious Query Processing , 2009, Encyclopedia of Database Systems.

[10]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[11]  Marcin Zukowski,et al.  Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS , 2007, VLDB.

[12]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[13]  Elke A. Rundensteiner,et al.  Redoop: Supporting Recurring Queries in Hadoop , 2014, EDBT.

[14]  Frederick Reiss,et al.  Main-memory scan sharing for multi-core CPUs , 2008, Proc. VLDB Endow..

[15]  Michael Stonebraker,et al.  The case for partial indexes , 1989, SGMD.

[16]  Norman May,et al.  SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA , 2013, BTW.

[17]  Feilong Liu,et al.  Forecasting the cost of processing multi-join queries via hashing for main-memory databases , 2015, SoCC.

[18]  Harumi A. Kuno,et al.  Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores , 2011, Proc. VLDB Endow..

[19]  Jonathan Goldstein,et al.  Optimizing queries using materialized views: a practical, scalable solution , 2001, SIGMOD '01.

[20]  Kenneth A. Ross,et al.  A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort , 2014, SIGMOD Conference.

[21]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[22]  Frank Wm. Tompa,et al.  Optimal top-down join enumeration , 2007, SIGMOD '07.

[23]  Gustavo Alonso,et al.  Predictable Performance for Unpredictable Workloads , 2009, Proc. VLDB Endow..

[24]  Tim Kraska,et al.  Tupleware: "Big" Data, Big Analytics, Small Clusters , 2015, CIDR.

[25]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[26]  Pat Hanrahan Analytic database technologies for a new kind of user: the data enthusiast , 2012, SIGMOD Conference.

[27]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[28]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[29]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[30]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Carsten Binnig,et al.  Locality-aware Partitioning in Parallel Database Systems , 2015, SIGMOD Conference.

[32]  Nick Roussopoulos,et al.  The Implementation and Performance Evaluation of the ADMS Query Optimizer: Integrating Query Result Caching and Matching , 1994, EDBT.

[33]  Gustavo Alonso,et al.  Shared Workload Optimization , 2014, Proc. VLDB Endow..