论文信息 - Load balancing and skew resilience for parallel joins

Load balancing and skew resilience for parallel joins

We address the problem of load balancing for parallel joins.We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms state-of-the-art techniques by up to a factor of 15.

[1] Christoph Koch,et al. Scalable and Adaptive Online Joins , 2014, Proc. VLDB Endow..

[2] David J. DeWitt,et al. Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[3] Dan Suciu,et al. From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[4] Yannis E. Ioannidis,et al. Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[5] Kenneth A. Ross,et al. Track join: distributed joins with minimal network traffic , 2014, SIGMOD Conference.

[6] Nicolas Bruno,et al. Advanced Join Strategies for Large-Scale Distributed Computation , 2014, Proc. VLDB Endow..

[7] Rajeev Motwani,et al. Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[8] Piotr Berman,et al. Slice and dice: a simple, improved approximate tiling recipe , 2002, SODA '02.

[9] Joseph M. Hellerstein,et al. Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10] Peter J. Haas,et al. Non-uniformity issues and workarounds in bounded-size sampling , 2013, The VLDB Journal.

[11] Yossi Matias,et al. Fast incremental maintenance of approximate histograms , 1997, TODS.

[12] Torsten Suel,et al. Approximation algorithms for array partitioning problems , 2005, J. Algorithms.

[13] Philip S. Yu,et al. An effective algorithm for parallelizing hash joins in the presence of data skew , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[14] Scott Shenker,et al. Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[15] Jignesh M. Patel,et al. A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[16] Graham Cormode,et al. Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[17] Christopher Olston,et al. Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.

[18] Charles E. Leiserson,et al. Executing task graphs using work-stealing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[19] Dan Suciu,et al. Skew in parallel query processing , 2014, PODS.

[20] Yusu Wang,et al. Relations between Two Common Types of Rectangular Tilings , 2006, Int. J. Comput. Geom. Appl..

[21] Andrey Gubarev,et al. Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[22] Mirek Riedewald,et al. Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[23] Junfeng Yang,et al. Optimizing Data Partitioning for Data-Parallel Computing , 2011, HotOS.

[24] Honesty C. Young,et al. A Symmetric Fragment and Replicate Algorithm for Distributed Joins , 1993, IEEE Trans. Parallel Distributed Syst..

[25] Liang Chen,et al. Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[26] David J. DeWitt,et al. Practical Skew Handling in Parallel Joins , 1992, VLDB.

[27] Torsten Suel,et al. On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications , 1999, ICDT.

[28] Sudipto Guha,et al. Dynamic multidimensional histograms , 2002, SIGMOD '02.

[29] A. N. Wilschut,et al. Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.

[30] Yufei Tao,et al. RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.