Supporting Streaming Updates in an Active Data Warehouse

Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed on-line and thus achieves a higher consistency between the stored information and the latest data updates. The need for on-line warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream S of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations, such as, surrogate key assignment, duplicate detection or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MeshJoin), that compensates for the difference in the access cost of the two join inputs by (a) relying entirely on fast sequential scans of R, and (b) sharing the I/O cost of accessing R across multiple tuples of S. We detail the Mesh Join algorithm and develop a systematic cost model that enables the tuning of Mesh Join for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of Mesh Join on synthetic and real-life data. Our results verify the scalability of Mesh-Join to fast streams and large relations, and demonstrate its numerous advantages over existing join algorithms.

[1]  Stephen G. Warren,et al.  Edited synoptic cloud reports from ships and land stations over the globe , 1996 .

[2]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[3]  Jennifer Widom,et al.  Performance Issues in Incremental Warehouse Maintenance , 2000, VLDB.

[4]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[5]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[6]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[7]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[8]  Evaggelia Pitoura,et al.  ETL queues for active data warehousing , 2005, IQIS '05.

[9]  Inderpal Singh Mumick,et al.  Incremental maintenance of aggregate and outerjoin expressions , 2006, Inf. Syst..

[10]  Michael J. Franklin,et al.  PSoup: a system for streaming queries over streaming data , 2003, The VLDB Journal.

[11]  Elke A. Rundensteiner,et al.  Integrating the maintenance and synchronization of data warehouses using a cooperative framework , 2002, Inf. Syst..

[12]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[13]  Bernhard Seeger,et al.  Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm , 2002, VLDB.

[14]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .

[15]  Hector Garcia-Molina,et al.  Efficient resumption of interrupted warehouse loads , 2000, SIGMOD '00.

[16]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[17]  Jennifer Widom,et al.  View maintenance in a warehousing environment , 1995, SIGMOD '95.

[18]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[19]  Yufei Tao,et al.  RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.

[20]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.