RPK-table based efficient algorithm for join-aggregate query on MapReduce

Abstract Join-aggregate is an important and widely used operation in database system. However, it is time-consuming to process join-aggregate query in big data environment, especially on MapReduce framework. The main bottlenecks contain two aspects: lots of I/O caused by temporary data and heavy communication overhead between different data nodes during query processing. To overcome such disadvantages, we design a data structure called Reference Primary Key table (RPK-table) which stores the relationship of primary key and foreign key between tables. Based on this structure, we propose an improved algorithm on MapReduce framework for join-aggregate query. Experiments on TPC-H dataset demonstrate that our algorithm outperforms existing methods in terms of communication cost and query response time.

[1]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[2]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[3]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[4]  Michael H. Böhlen,et al.  Efficient computation of subqueries in complex OLAP , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[5]  Jianzhong Li,et al.  Efficiently processing (p,ε)-approximate join aggregation on massive data , 2014, Inf. Sci..

[6]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  David Taniar,et al.  Performance analysis of "Groupby-After-Join" query processing in parallel database systems , 2004, Inf. Sci..

[9]  Shih-Ying Chen,et al.  An Efficient Join Query Processing Based on MJR Framework , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[10]  Kai Wang,et al.  Accelerating MapReduce with Distributed Memory Cache , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[11]  Don-Lin Yang,et al.  Efficient approaches for materialized views selection in a data warehouse , 2007, Inf. Sci..

[12]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[13]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[14]  Jiawei Han,et al.  Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[15]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[16]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.