Optimization of Data Distribution Strategy in Theta-join Process based on Spark

The theta-join between tables is a common operation in the data query and statistical analysis. When dealing with large amounts of data, it will produce a great deal of cost. The theta-join inevitably generates huge computing and communication overhead during data processing in the distributed environment. Besides, due to the diversity of data, it also brings about the problem of data skew. In order to solve uneven data distribution in theta-join and data skew in data processing, we propose a solution, which can improve the data filtering strategy and put forward a data distribution method using some affecting factors of data join efficiency quantified by us. Our solution is implemented based on the distributed computing framework Spark. The experimental results show that our method can be used for many types of data and also shows better performance.