Recently, the cloud computing platform is getting more and more attentions as a new trend of data management. Currently there are several cloud computing products that can provide various services. However, most cloud platforms are not designed for structured data management. So they rarely support SQL queries directly. Even though some platforms support SQL queries, their bottoms are traditional relational database, therefore, the cost for executing a subquery in RDBS may influence the overall query performance. How to improve query efficiency in cloud data management system, especially query on structured data has become a more and more important problem. To address the issue, an efficient algorithm about query processing on structured data is proposed. Our approach is inspired by the idea of MapReduce, in which a job is divided into several tasks. Based on the distributed storage of one table, this algorithm divides a user query into different subqueries, at the same time, with replicas in cloud, a subquery is mapped to k+1 subqueries. Every subquery has to wait in the queue of the slave where the query data store. To balance the load, our algorithm also takes two scheduling strategies to dispatch the subquery. Besides, in order to reduce the client's long waiting time, we adopt the pipeline strategy to process result returning. Finally, we demonstrate the efficiency and scalability of our algorithm with kinds of experiments. Our approach is quite general and independent from the underlying infrastructure and can be easily carried over for implementation on various cloud computing platforms.
[1]
Daniel J. Abadi,et al.
Data Management in the Cloud: Limitations and Opportunities
,
2009,
IEEE Data Eng. Bull..
[2]
GhemawatSanjay,et al.
The Google file system
,
2003
.
[3]
Werner Vogels,et al.
Dynamo: amazon's highly available key-value store
,
2007,
SOSP.
[4]
Jingren Zhou,et al.
SCOPE: easy and efficient parallel processing of massive data sets
,
2008,
Proc. VLDB Endow..
[5]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[6]
Pete Wyckoff,et al.
Hive - A Warehousing Solution Over a Map-Reduce Framework
,
2009,
Proc. VLDB Endow..
[7]
Wilson C. Hsieh,et al.
Bigtable: A Distributed Storage System for Structured Data
,
2006,
TOCS.
[8]
Jianliang Xu,et al.
DigestJoin: Exploiting Fast Random Reads for Flash-Based Joins
,
2009,
2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware.
[9]
Abraham Silberschatz,et al.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
,
2009,
Proc. VLDB Endow..
[10]
Hamid Pirahesh,et al.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
,
1996,
Data Mining and Knowledge Discovery.
[11]
Giovanni Maria Sacco,et al.
Query Optimization in Distributed Data Base Systems
,
1982,
Adv. Comput..
[12]
Rob Pike,et al.
Interpreting the data: Parallel analysis with Sawzall
,
2005,
Sci. Program..