Dynamically optimizing queries over large scale data platforms

Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this paper, we propose new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.

[1]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[2]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[3]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[4]  Zhen He,et al.  Self-tuning cost modeling of user-defined functions in an object-relational DBMS , 2005, TODS.

[5]  Surajit Chaudhuri,et al.  Optimization of queries with user-defined predicates , 1996, TODS.

[6]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[7]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[8]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[9]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[10]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[11]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[12]  Nicolas Bruno,et al.  Continuous Cloud-Scale Query Optimization and Processing , 2013, Proc. VLDB Endow..

[13]  Volker Markl,et al.  Progressive optimization in a shared-nothing parallel database , 2007, SIGMOD '07.

[14]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Herodotos Herodotou,et al.  Stubby: A Transformation-based Optimizer for MapReduce Workflows , 2012, Proc. VLDB Endow..

[17]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[18]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[19]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[20]  Per-Åke Larson,et al.  The Hekaton Memory-Optimized OLTP Engine , 2013, IEEE Data Eng. Bull..

[21]  Goetz Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[22]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[23]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[24]  Andrey Balmin,et al.  Adaptive MapReduce using situation-aware mappers , 2012, EDBT '12.

[25]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[26]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[27]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[28]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[29]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[30]  Jianyong Dai,et al.  Apache Pig's Optimizer , 2013, IEEE Data Eng. Bull..

[31]  Hamid Pirahesh,et al.  Robust query processing through progressive optimization , 2004, SIGMOD '04.

[32]  Hamid Pirahesh,et al.  Extensible/rule based query rewrite optimization in Starburst , 1992, SIGMOD '92.

[33]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[34]  Joseph M. Hellerstein,et al.  Optimization techniques for queries with expensive methods , 1998, TODS.

[35]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[36]  David J. DeWitt,et al.  Efficient mid-query re-optimization of sub-optimal query execution plans , 1998, SIGMOD '98.

[37]  Li Chen,et al.  Regression-Based Self-Tuning Modeling of Smooth User-Defined Function Costs for an Object-Relational Database Management System Query Optimizer , 2004, Comput. J..

[38]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[39]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40]  David J. DeWitt,et al.  Proactive re-optimization , 2005, SIGMOD '05.