Processing Exact Results for Windowed Stream Joins in a Memory-Limited System: A Disk-Based, Adaptive Approach

We consider the problem of processing exact results for sliding window joins over data streams with limited memory. Existing approaches either, (1) deal with memory limitations by shedding loads, and therefore cannot provide exact or even highly accurate results for sliding window joins over data streams showing time-varying rate of data arrivals, or (2) suffer from large I/O overhead due to random disk flushes and disk-to-disk stages with a stream join, making the approaches inefficient to handle sliding window joins. We provide an Adaptive, Hash-partitioned Exact Window Join (AH-EWJ) algorithm incorporating disk storage as an archive. Our algorithm spills window data onto the disk on a periodic basis, refines the output result by properly retrieving the disk-resident data, maximizes output rate by employing techniques to manage the memory blocks, and continuously adjusting the allocated memory within the stream windows. The problem of managing the window blocks in memory—similar in nature to the caching issue—captures both the temporal and frequency related properties of the stream arrivals. We present a baseline algorithm called Rate-based Progressive Window Joins (RPWJ), which extends an existing algorithm to tune the performance by reducing disk I/O overhead while processing sliding window joins. We provide experimental results demonstrating the performance and effectiveness of the proposed algorithm.

[1]  Elke A. Rundensteiner,et al.  Run-time operator state spilling for memory intensive long-running queries , 2006, SIGMOD Conference.

[2]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[3]  Rajeev Motwani,et al.  Caching queues in memory buffers , 2004, SODA '04.

[4]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[5]  Philip S. Yu,et al.  A Load Shedding Framework and Optimizations for M-way Windowed Stream Joins , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[6]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[7]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[8]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[9]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[10]  Jens Teubner,et al.  How soccer players would do stream joins , 2011, SIGMOD '11.

[11]  Jennifer Widom,et al.  Adaptive caching for continuous queries , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[13]  Christos Faloutsos,et al.  Capturing the spatio-temporal behavior of real traffic data , 2002, Perform. Evaluation.

[14]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[15]  Ajit Singh,et al.  A partition-based approach to support streaming updates over persistent data in an active datawarehouse , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Mohamed F. Mokbel,et al.  PermJoin: An Efficient Algorithm for Producing Early Results in Multi-join Query Plans , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Philip S. Yu,et al.  Adaptive load shedding for windowed stream joins , 2005, CIKM '05.

[18]  Vasilis Vassalos,et al.  Double Index NEsted-Loop Reactive Join for Result Rate Optimization , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Walid G. Aref,et al.  Hash-merge join: a non-blocking join algorithm for producing fast and early join results , 2004, Proceedings. 20th International Conference on Data Engineering.

[20]  Wen-Chi Hou,et al.  Window join approximation over data streams with importance semantics , 2006, CIKM '06.

[21]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Bernhard Seeger,et al.  Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm , 2002, VLDB.

[23]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[24]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[25]  Christos Faloutsos,et al.  Data mining meets performance evaluation: fast algorithms for modeling bursty traffic , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Ajit Singh,et al.  A Disk-Based, Adaptive Approach to Memory-Limited Computation of Windowed Stream Joins , 2010, DEXA.

[27]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.

[28]  Yufei Tao,et al.  RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.