Application of Filters to Multiway Joins in MapReduce

Joining multiple datasets in MapReduce may amplify the disk and network overheads because intermediate join results have to be written to the underlying distributed file system, or map output records have to be replicated multiple times. This paper proposes a method for applying filters based on the processing order of input datasets, which is appropriate for the two types of multiway joins: common attribute joins and distinct attribute joins. The number of redundant records filtered depends on the processing order. In common attribute joins, the input records do not need to be replicated, so a set of filters is created, which are applied in turn. In distinct attribute joins, the input records have to be replicated, so multiple sets of filters need to be created, which depend on the number of join attributes. The experimental results showed that our approach outperformed a cascade of two-way joins and basic multiway joins in cases where small portions of input datasets were joined.

[1]  Dafang Zhang,et al.  Accurate Counting Bloom Filters for Large-Scale Data Processing , 2013 .

[2]  Lei Wu,et al.  Efficien t Processing Distributed Joins with Bloomfilter using MapReduce y , 2013 .

[3]  Jing Li,et al.  SEJ: An Even Approach to Multiway Theta-Joins Using MapReduce , 2012, 2012 Second International Conference on Cloud and Green Computing.

[4]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[6]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[7]  Konstantina Palla A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework , 2009 .

[8]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[9]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[10]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[11]  Alfons Kemper,et al.  Generalised Hash Teams for Join and Group-by , 1999, VLDB.

[12]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[13]  Ramon Lawrence,et al.  Using slice join for efficient evaluation of multi-way joins , 2008, Data Knowl. Eng..

[14]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Jing Li,et al.  Efficient Processing Distributed Joins with Bloomfilter using MapReduce † , 2013 .

[16]  Taewhi Lee,et al.  Exploiting Bloom Filters for Efficient Joins in MapReduce , 2013 .

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Hyoung-Joo Kim,et al.  Join processing using Bloom filter in MapReduce , 2012, RACS.