Multi-Q: Multiple Queries Optimization Based on MapReduce in Cloud

With the explosion of data in the past decade, big data is becoming a research hotspot in the information field. Many cloud-based distributed data processing platforms have been proposed to provide efficient and cost effective solutions for big data query processing, such as Hadoop, Hive, Pig, etc. However, most of the current research works are focus on improving the performance of query processing based on the view of systematics while without considering the characteristics of queries themselves, such as the query similarity, which will cause large numbers of redundant computation, effect query execution efficiency, thus having an adverse impact on promotion of the multi-queries processing performance. To solve this problem, in this paper, we propose a Multi-queries optimization framework based on MapReduce-oriented cloud environment (Multi-Q), which utilizes the dependence between multiple queries to realize query results reuse. Firstly, a cluster-based partition algorithm called CPA has been exploited to conduct the logic partition of the search range of query workload. Secondly, a multi-queries reuse dependence graph (MRDG) construction method on the basis of the cluster-based partition results has been presented to depict the dependence between the multiple queries. Finally, a Multi-Q processing algorithm based on Multi-Q Reuse Dependence Graph has been put forward to achieve the query results reuse and improve the overall query processing performance. We evaluate our approach by deploying Multi-Q based on Hadoop in a real cloud environment, called SEU-Cloud, and conducting extensive experiments based on the standard TPC-H. The result verifies that compared with Hive, the performance of improvement is approximately 39.3% by using our Multi-Q.