Flexible and Adaptive Stream Join Algorithm

Flexibility and self-adaptivity are important to real-time join processing in a parallel shared-nothing environment. Join-Matrix is a high-performance model on distributed stream joins and supports arbitrary join predicates. It can handle data skew perfectly since it randomly routes tuples to cells with each steam corresponding to one side of the matrix. Designing of the partitioning scheme of the matrix is a determining factor to maximize system throughputs under the premise of economizing computing resources. In this paper, we propose a novel flexible and adaptive scheme partitioning algorithm for stream join operator, which ensures high throughput but with economical resource usages by allocating resources on demand. Specifically, a lightweight scheme generator, which requires the sample of each stream volume and processing resource quota of each physical machine, generates a join scheme; then a migration plan generator decides how to migrate data among machines under the consideration of minimizing migration cost while ensuring correctness. Extensive experiments are done on different kind of join workloads and show high competence comparing with baseline systems on benchmark.

[1]  Beng Chin Ooi,et al.  Scalable Distributed Stream Join Processing , 2015, SIGMOD Conference.

[2]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[3]  Christoph Koch,et al.  Load balancing and skew resilience for parallel joins , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[4]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[6]  Gianmarco De Francisci Morales,et al.  When two choices are not enough: Balancing at scale in Distributed Stream Processing , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[7]  Ying Xing,et al.  Providing resiliency to load variations in distributed stream processing , 2006, VLDB.

[8]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[9]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[10]  Nicolas Bruno,et al.  Advanced Join Strategies for Large-Scale Distributed Computation , 2014, Proc. VLDB Endow..

[11]  Honesty C. Young,et al.  A Symmetric Fragment and Replicate Algorithm for Distributed Joins , 1993, IEEE Trans. Parallel Distributed Syst..

[12]  Gianmarco De Francisci Morales,et al.  The power of both choices: Practical load balancing for distributed stream processing engines , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[14]  Elke A. Rundensteiner,et al.  A Dynamically Adaptive Distributed System for Processing Complex Continuous Queries , 2005, VLDB.

[15]  Bugra Gedik Partitioning functions for stateful data parallelism in stream processing , 2013, The VLDB Journal.

[16]  Christoph Koch,et al.  Scalable and Adaptive Online Joins , 2014, Proc. VLDB Endow..