论文信息 - RPK-table based efficient algorithm for join-aggregate query on MapReduce

RPK-table based efficient algorithm for join-aggregate query on MapReduce

Abstract Join-aggregate is an important and widely used operation in database system. However, it is time-consuming to process join-aggregate query in big data environment, especially on MapReduce framework. The main bottlenecks contain two aspects: lots of I/O caused by temporary data and heavy communication overhead between different data nodes during query processing. To overcome such disadvantages, we design a data structure called Reference Primary Key table (RPK-table) which stores the relationship of primary key and foreign key between tables. Based on this structure, we propose an improved algorithm on MapReduce framework for join-aggregate query. Experiments on TPC-H dataset demonstrate that our algorithm outperforms existing methods in terms of communication cost and query response time.

[1] Carlo Zaniolo,et al. Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[2] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[3] Yon Dohn Chung,et al. Parallel data processing with MapReduce: a survey , 2012, SGMD.

[4] Michael H. Böhlen,et al. Efficient computation of subqueries in complex OLAP , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[5] Jianzhong Li,et al. Efficiently processing (p,ε)-approximate join aggregation on massive data , 2014, Inf. Sci..

[6] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] David Taniar,et al. Performance analysis of "Groupby-After-Join" query processing in parallel database systems , 2004, Inf. Sci..

[9] Shih-Ying Chen,et al. An Efficient Join Query Processing Based on MJR Framework , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[10] Kai Wang,et al. Accelerating MapReduce with Distributed Memory Cache , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[11] Don-Lin Yang,et al. Efficient approaches for materialized views selection in a data warehouse , 2007, Inf. Sci..

[12] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[13] Christos Doulkeridis,et al. A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[14] Jiawei Han,et al. Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[15] Ion Stoica,et al. BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[16] Michael Isard,et al. Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.