Exploiting Bloom Filters for Efficient Joins in MapReduce

MapReduce is a programming model that is extensively used for largescale data analysis. However, it is inefficient to perform join operations using MapReduce, because large intermediate results are produced, even in cases where only a small fraction of input data participate in the join. We alleviate this problem by exploiting Bloom filters within a single MapReduce job. We create Bloom filters for an input dataset, and filter out the redundant records in the other input dataset in the map phase. To do this, we modify the MapReduce framework in two ways. First, map tasks are scheduled according to the processing order of input datasets. Second, Bloom filters are dynamically created in a distributed fashion. We propose two map task scheduling policies and provide a method to determine the processing order based on the estimated cost. Our experimental results show that the proposed techniques decrease the size of intermediate results and can improve the execution time.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[3]  Wolfgang Nejdl,et al.  Improving distributed join efficiency with extended bloom filter operations , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).

[4]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[5]  Philip A. Bernstein,et al.  Using Semi-Joins to Solve Relational Queries , 1981, JACM.

[6]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7]  Alfons Kemper,et al.  Generalised Hash Teams for Join and Group-by , 1999, VLDB.

[8]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[9]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[10]  Odysseas Papapetrou,et al.  Optimizing Distributed Joins with Bloom Filters , 2008, ICDCIT.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[13]  Anja Gruenheid,et al.  Query optimization using column statistics in hive , 2011, IDEAS '11.

[14]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[15]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[17]  Konstantina Palla A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework , 2009 .

[18]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[19]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[20]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..