Logical query optimization for Cloudera Impala system

Abstract Cloudera Impala, an analytic database system for Apache Hadoop, has a severe problem with query plan generation: the system can only generate query plans in left-deep tree form, which restricts the ability of parallel execution. In this paper, we present a logical query optimization scheme for Impala system. First, an improved McCHyp (MinCutConservative Hypergraph) logical query plan generation algorithm is proposed for Impala system. It can reduce the plan generation time by introducing a pruning strategy. Second, a new cost model that takes the characteristics of Impala system into account is proposed. Finally, Impala system is extended to support query plans in bushy tree form by integrating the plan generation algorithm. We evaluated our scheme using TPC-DS test suit. Experimental results show that the extended Impala system generally performs better than the original system, and the improved plan generation algorithm has less execution time than McCHyp. In addition, our cost model fits better for Impala system, which supports query plans in bushy tree form.

[1]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[2]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[3]  Dawit Yimam Seid,et al.  Adaptive optimizations of recursive queries in teradata , 2012, SIGMOD Conference.

[4]  Seyed Mohammad Taghi Rouhani Rankoohi,et al.  A multi-colony ant algorithm for optimizing join queries in distributed database systems , 2012, Knowledge and Information Systems.

[5]  Hyoung-Joo Kim,et al.  A two phase optimization technique for XML queries with multiple regular path expressions , 2002, J. Syst. Softw..

[6]  Laura M. Haas,et al.  Seeking the truth about ad hoc join costs , 1997, The VLDB Journal.

[7]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[8]  Guido Moerkotte,et al.  Heuristic and randomized optimization for the join ordering problem , 1997, The VLDB Journal.

[9]  Guido Moerkotte,et al.  Reassessing Top-Down Join Enumeration , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  Guido Moerkotte,et al.  On the Complexity of Generating Optimal Left-Deep Processing Trees with Cross Products , 1995, ICDT.

[11]  Scott Shenker,et al.  Shark: fast data analysis using coarse-grained distributed memory , 2012, SIGMOD Conference.

[12]  Guido Moerkotte,et al.  Top down plan generation: From theory to practice , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[13]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[14]  Guido Moerkotte,et al.  Errata for "Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products" , 2006, Proc. VLDB Endow..

[15]  Guido Moerkotte,et al.  Effective and Robust Pruning for Top-Down Join Enumeration Algorithms , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  Guido Moerkotte,et al.  Dynamic programming strikes back , 2008, SIGMOD Conference.

[17]  David J. DeWitt,et al.  Query optimization in microsoft SQL server PDW , 2012, SIGMOD Conference.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Anja Gruenheid,et al.  Query optimization using column statistics in hive , 2011, IDEAS '11.

[20]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[21]  Ahmet Cosar,et al.  An evolutionary genetic algorithm for optimization of distributed database queries , 2009, 2009 24th International Symposium on Computer and Information Sciences.