论文信息 - Optimization of Data Distribution Strategy in Theta-join Process based on Spark

Optimization of Data Distribution Strategy in Theta-join Process based on Spark

The theta-join between tables is a common operation in the data query and statistical analysis. When dealing with large amounts of data, it will produce a great deal of cost. The theta-join inevitably generates huge computing and communication overhead during data processing in the distributed environment. Besides, due to the diversity of data, it also brings about the problem of data skew. In order to solve uneven data distribution in theta-join and data skew in data processing, we propose a solution, which can improve the data filtering strategy and put forward a data distribution method using some affecting factors of data join efficiency quantified by us. Our solution is implemented based on the distributed computing framework Spark. The experimental results show that our method can be used for many types of data and also shows better performance.

Meina Song | E. Haihong | Ken Zhang | Shijiu Cao

[1] Mostafa Bamha,et al. Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model , 2015, ICCS.

[2] Wenjie Liu,et al. An Efficient Filter Strategy for Theta-Join Query in Distributed Environment , 2017, 2017 46th International Conference on Parallel Processing Workshops (ICPPW).

[3] In-Hak Joo. Spatial Big Data Query Processing System Supporting SQL-based Query Language in Hadoop , 2017 .

[4] Sang-goo Lee,et al. Handling data skew in join algorithms using MapReduce , 2016, Expert Syst. Appl..

[5] Wenhong Tian,et al. A Comparative Study of Data Skew in Hadoop , 2017, ICNCC 2017.

[6] Mirek Riedewald,et al. Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[7] Jing Li,et al. Optimizing Theta-Joins in a MapReduce Environment , 2013 .

[8] Hyoung-Joo Kim,et al. Join processing using Bloom filter in MapReduce , 2012, RACS.