With the explosion of data in the past decade, big data is becoming a research hotspot in the information field. Many cloud-based distributed data processing platforms have been proposed to provide efficient and cost effective solutions for big data query processing, such as Hadoop, Hive, Pig, etc. However, most of the current research works are focus on improving the performance of query processing based on the view of systematics while without considering the characteristics of queries themselves, such as the query similarity, which will cause large numbers of redundant computation, effect query execution efficiency, thus having an adverse impact on promotion of the multi-queries processing performance. To solve this problem, in this paper, we propose a Multi-queries optimization framework based on MapReduce-oriented cloud environment (Multi-Q), which utilizes the dependence between multiple queries to realize query results reuse. Firstly, a cluster-based partition algorithm called CPA has been exploited to conduct the logic partition of the search range of query workload. Secondly, a multi-queries reuse dependence graph (MRDG) construction method on the basis of the cluster-based partition results has been presented to depict the dependence between the multiple queries. Finally, a Multi-Q processing algorithm based on Multi-Q Reuse Dependence Graph has been put forward to achieve the query results reuse and improve the overall query processing performance. We evaluate our approach by deploying Multi-Q based on Hadoop in a real cloud environment, called SEU-Cloud, and conducting extensive experiments based on the standard TPC-H. The result verifies that compared with Hive, the performance of improvement is approximately 39.3% by using our Multi-Q.
[1]
Mong-Li Lee,et al.
ICICLES: Self-Tuning Samples for Approximate Query Answering
,
2000,
VLDB.
[2]
Ravi Kumar,et al.
Pig latin: a not-so-foreign language for data processing
,
2008,
SIGMOD Conference.
[3]
Liang Dong,et al.
Starfish: A Self-tuning System for Big Data Analytics
,
2011,
CIDR.
[4]
Zheng Shao,et al.
Hive - a petabyte scale data warehouse using Hadoop
,
2010,
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).
[5]
Ying Wah Teh,et al.
On Density-Based Data Streams Clustering Algorithms: A Survey
,
2014,
Journal of Computer Science and Technology.
[6]
Christopher Olston,et al.
Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience
,
2009,
Proc. VLDB Endow..
[7]
Hans-Peter Kriegel,et al.
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
,
1996,
KDD.
[8]
Scott Shenker,et al.
Shark: fast data analysis using coarse-grained distributed memory
,
2012,
SIGMOD Conference.
[9]
Beng Chin Ooi,et al.
Continuous sampling for online aggregation over multiple queries
,
2010,
SIGMOD Conference.
[10]
Scott Shenker,et al.
Shark: SQL and rich analytics at scale
,
2012,
SIGMOD '13.
[11]
Xiaofeng Meng,et al.
You can stop early with COLA: online processing of aggregate queries in the cloud
,
2012,
CIKM.
[12]
Hairong Kuang,et al.
The Hadoop Distributed File System
,
2010,
2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).