Semi-join computation on distributed file systems using map-reduce-merge model

Semi-join is the most used technique to optimize the treatment of complex relational queries on distributed architectures. However, the overhead related to semi-join computation can be very high due to data skew and to the high cost of communication in distributed architectures. Internet search engines needs to process vast amounts of raw data every day. Hence, systems that manage such data should assure scalability, reliability and availability issues with reasonable query processing time. Hadoop and Google's File System are examples of such systems. In this paper, we present a new algorithm based on Map-Reduce-Merge model and distributed histograms for processing semi-join operations on such systems. A cost analysis of this algorithm shows that our approach is insensitive to data skew while reducing communication and disk Input/Output costs to a minimum.

[1]  Mostafa Bamha,et al.  A Skew-insensitive Algorithm for Join and Multi-join Operations on Shared Nothing Machines , 2000, DEXA.

[2]  Alfons Kemper,et al.  Integrating semi-join-reducers into state-of-the-art query processors , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[5]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[6]  Phu Dung Le,et al.  Novel parallel join algorithms for grid files , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[7]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[8]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[9]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[10]  Mostafa Bamha,et al.  Frequency-Adaptive Join for Shared Nothing Machines , 1999, Scalable Comput. Pract. Exp..

[11]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Mostafa Bamha An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures , 2005, DEXA.