An Efficient Join Query Processing Based on MJR Framework

Large data analysis is an important topic in cloud computing. Large-scale data analysis requires complex data analysis, such as Theta-Join, which includes equi-join and nonequi-join. On the other hand, MapReduce is a programming framework in cloud computing to compute data analysis in parallel. In order to improve MapRduce performance in complex data analysis, researchers propose the Map-Join-Reduce API to support the equi-join operation. The proposed method not only extends the Map-Join-Reduce framework but also supports nonequi-join. We propose three concepts. First data are filtered first according to the query statements. Second, the filtered data are sent to its corresponding worker according to the join expression for higher level parallelism. Each worker then performs the corresponding join operation after receiving the filtered data. Finally, we aggregate the result by using aggregate functions specified in the select clause.

[1]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Robert L. Grossman,et al.  Compute and storage clouds using wide area high performance networks , 2008, Future Gener. Comput. Syst..

[3]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[4]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.

[7]  Hyeonsang Eom,et al.  Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks , 2011, Cluster Computing.

[8]  GhemawatSanjay,et al.  The Google file system , 2003 .

[9]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[10]  David J. DeWitt,et al.  Clustera: an integrated computation and data management system , 2008, Proc. VLDB Endow..

[11]  Daniel J. Abadi,et al.  Column oriented Database Systems , 2009, Proc. VLDB Endow..

[12]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[13]  Jairam Chandar Join Algorithms using Map/Reduce , 2010 .