Optimization of Multiple Queries for Big Data with Apache Hadoop/Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. The Hadoop MapReduce framework speed up the execution of queries. This manuscript proposes the use of Multi Query Optimization (MQO) technique to enhance the overall performance of Hadoop/Hive. During simultaneous execution of multiple queries, many opportunities can arise for distribution search and/ or computation tasks. Executing common jobs only once can reduce the total execution time of all queries remarkably. Our framework, transforms a set of interrelated HiveQL queries into new global queries that can produce the same results in remarkably smaller total execution times. It is experimentally shown that proposed Hive (Distributed Hive) outperforms the conventional Hive by 20-50% reduction, depending on the number of queries and percentage of shared tasks, in the total execution time of correlated TPC-H queries.

[1]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[2]  David J. DeWitt,et al.  Managing Intra-operator Parallelism in Parallel Database Systems , 1995, VLDB.

[3]  Gang Chen,et al.  Optimization of sub-query processing in distributed data integration systems , 2011, J. Netw. Comput. Appl..

[4]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  Cameron David Rose,et al.  The Hive Project , 2011 .

[6]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[7]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Guido Moerkotte,et al.  On the Complexity of Generating Optimal Left-Deep Processing Trees with Cross Products , 1995, ICDT.

[10]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[11]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[12]  Joel H. Saltz,et al.  Processing large-scale multi-dimensional data in parallel and distributed environments , 2002, Parallel Comput..

[13]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[14]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[15]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[16]  Archana Ganapathi,et al.  Statistics-driven workload modeling for the Cloud , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[19]  Murat Ali Bayir,et al.  Genetic Algorithm for the Multiple-Query Optimization Problem , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Matthias Jarke,et al.  Common Subexpression Isolation in Multiple Query Optimization , 1984, Query Processing in Database Systems.