Big Data Equi-Join Optimization Algorithms on Spark Cloud Computing Platform

On Spark cloud computing platform, the conventional big data equi-join algorithms cannot meet the performance requirements well and the procedure of it is very time-consuming, so the efficiency of big data equi-join is a burning challenge. To overcome it, in this paper, we propose Compressed Bloom Filter Join algorithm, an efficient algorithm filters out most of invalid connections which cannot meet the criteria to reduce network overhead, and it constructs static one-dimensional bit array to improve join performance. Moreover, Compressed Bloom Filter Join Extension algorithm, an extended optimization based on Compressed Bloom Filter Join algorithm, produces a dynamic two-dimensional bit array to filter out invalid records, and it can further accelerate the process of data join when the data size is unknown. Experimental results show that the performance of two optimization algorithms which can reduce time consumption and the data size of Shuffle stage are better than Hash Join and Broadcast Join on Spark cloud computing platform.

[1]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[2]  Reynold Xin Spark and Scala (keynote) , 2017, SCALA@SPLASH.

[3]  Odysseas Papapetrou,et al.  Optimizing Distributed Joins with Bloom Filters , 2008, ICDCIT.

[4]  Qing Yang,et al.  Performance Evaluation for Distributed Join Based on MapReduce , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[5]  Guoqiang Li,et al.  Indexing for Large Scale Data Querying Based on Spark SQL , 2017, 2017 IEEE 14th International Conference on e-Business Engineering (ICEBE).

[6]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[7]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.