Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.

[1]  Parag Agrawal,et al.  Scheduling shared scans of large data files , 2008, Proc. VLDB Endow..

[2]  Jonathan Goldstein,et al.  Optimizing queries using materialized views: a practical, scalable solution , 2001, SIGMOD '01.

[3]  Hans Kellerer,et al.  Knapsack problems , 2004 .

[4]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[5]  George Candea,et al.  A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses , 2009, Proc. VLDB Endow..

[6]  Anastasia Ailamaki,et al.  QPipe: a simultaneously pipelined relational query engine , 2005, SIGMOD '05.

[7]  MRShare , 2010 .

[8]  Panos Kalnis,et al.  View selection using randomized search , 2002, Data Knowl. Eng..

[9]  Nimrod Megiddo,et al.  Adaptive Caching in Big SQL using the HDFS Cache , 2016, SoCC.

[10]  Sheldon J. Finkelstein Common expression analysis in database applications , 1982, SIGMOD '82.

[11]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[12]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[13]  George C. Caragea,et al.  Optimization of Common Table Expressions in MPP Database Systems , 2015, Proc. VLDB Endow..

[14]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[15]  Jingren Zhou,et al.  Exploiting Common Subexpressions for Cloud Query Processing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  Anastasia Ailamaki,et al.  ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data , 2017, Proc. VLDB Endow..

[17]  Zohra Bellahsene,et al.  Selection of Materialized Views: a Cost-Based Approach , 2003, BDA.

[18]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[19]  Stratis Viglas,et al.  Recycling in pipelined query evaluation , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  Krithi Ramamritham,et al.  Materialized view selection and maintenance using multi-query optimization , 2000, SIGMOD '01.

[21]  Prabhakant Sinha,et al.  The Multiple-Choice Knapsack Problem , 1979, Oper. Res..

[22]  Bela Stantic,et al.  Simulated Annealing for Materialized View Selection in Data Warehousing Environment , 2006, Databases and Applications.

[23]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[24]  S. Sudarshan,et al.  Pipelining in multi-query optimization , 2001, PODS '01.

[25]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[26]  Wolfgang Lehner,et al.  Efficient exploitation of similar subexpressions for query processing , 2007, SIGMOD '07.

[27]  Jian Yang,et al.  Genetic Algorithm for Materialized View Selection in Data Warehouse Environments , 1999, DaWaK.

[28]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[29]  Jian Yang,et al.  Algorithms for Materialized View Design in Data Warehousing Environment , 1997, VLDB.

[30]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[31]  Ralph C. Merkle,et al.  Protocols for Public Key Cryptosystems , 1980, 1980 IEEE Symposium on Security and Privacy.

[32]  Gustavo Alonso,et al.  SharedDB: Killing One Thousand Queries With One Stone , 2012, Proc. VLDB Endow..

[33]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[34]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[35]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[36]  H. Kellerer,et al.  Introduction to NP-Completeness of Knapsack Problems , 2004 .

[37]  Pietro Michiardi,et al.  In-memory Caching for Multi-query Optimization of Data-intensive Scalable Computing Workloads , 2019, EDBT/ICDT Workshops.

[38]  Calisto Zuzarte,et al.  Optimization of generic progressive queries based on dependency analysis and materialized views , 2014, Information Systems Frontiers.

[39]  George Candea,et al.  Predictable performance and high query concurrency for data analytics , 2011, The VLDB Journal.

[40]  Peter Scheuermann,et al.  Dynamic caching of query results for decision support systems , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[41]  Carsten Binnig,et al.  Revisiting Reuse in Main Memory Database Systems , 2016, SIGMOD Conference.

[42]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[43]  Anastasia Ailamaki,et al.  Sharing Data and Work Across Concurrent Analytical Queries , 2013, Proc. VLDB Endow..

[44]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[45]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[46]  Prashant J. Shenoy,et al.  SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce , 2012, TODS.