Adaptive Load Diffusion for Multiway Windowed Stream Joins

In this paper, we present an adaptive load diffusion operator to enable scalable processing of multiway windowed stream joins (MWSJs) using a cluster system. The load diffusion is achieved by a set of novel semantics-pre serving tuple routing algorithms. Different from previous work, the load diffusion operator can (1) preserve the MWSJ semantics while spreading tuples to different hosts for parallel join processing; (2) achieve fine-grained load balancing among distributed hosts; and (3) perform semantics-preserving online adaptations to maintain optimal performance in dynamic stream environments. We have implemented a prototype of the distributed MWSJ framework on top of the System S distributed stream processing system. Our experiment results based on both real data streams and synthetic workloads show that the load diffusion algorithms can efficiently scale-up the performance of MWSJ processing with low overhead.

[1]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[2]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[3]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[4]  Rajeev Rastogi,et al.  Processing Data-Stream Join Aggregates Using Skimmed Sketches , 2004, EDBT.

[5]  Ramon Lawrence,et al.  Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[6]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[7]  Frederick Reiss,et al.  TelegraphCQ: An Architectural Status Report , 2003, IEEE Data Eng. Bull..

[8]  Michael Stonebraker,et al.  The Aurora and Medusa Projects , 2003, IEEE Data Eng. Bull..

[9]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[10]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[11]  David J. DeWitt,et al.  Tuple Routing Strategies for Distributed Eddies , 2003, VLDB.

[12]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[13]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[14]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[15]  Tore Risch,et al.  Customizable Parallel Execution of Scientific Stream Queries , 2005, VLDB.

[16]  Philip S. Yu,et al.  ViCo: an adaptive distributed video correlation system , 2006, MM '06.

[17]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[18]  Jennifer Widom,et al.  Adaptive ordering of pipelined stream filters , 2004, SIGMOD '04.

[19]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[20]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[21]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[22]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[23]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[24]  Amol Deshpande,et al.  An initial study of overheads of eddies , 2004, SGMD.

[25]  Joseph M. Hellerstein,et al.  Lifting the Burden of History from Adaptive Query Processing , 2004, VLDB.

[26]  Yufei Tao,et al.  RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.

[27]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[28]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.