Comparing Global Optimization and Default Settings of Stream-Based Joins - (Experimental Paper)

One problem encountered in real-time data integration is the join of a continuous incoming data stream with a disk-based relation. In this paper we investigate a stream-based join algorithm, called mesh join (MESHJOIN), and focus on a critical component in the algorithm, called the disk-buffer. In MESHJOIN the size of disk-buffer varies with a change in total memory budget and tuning is required to get the maximum service rate within limited available memory. Until now there was little data on the position of the optimum value depending on the memory size, and no performance comparison has been carried out between the optimum and reasonable default sizes for the disk-buffer. To avoid tuning, we propose a reasonable default value for the disk-buffer size with a small and acceptable performance loss. The experimental results validate our arguments.

[1]  Walid G. Aref,et al.  Hash-merge join: a non-blocking join algorithm for producing fast and early join results , 2004, Proceedings. 20th International Conference on Data Engineering.

[2]  Beate List,et al.  Striving towards Near Real-Time Data Integration for Data Warehouses , 2002, DaWaK.

[3]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[4]  Hector Garcia-Molina,et al.  Efficient resumption of interrupted warehouse loads , 2000, SIGMOD '00.

[5]  Yufei Tao,et al.  RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.

[6]  Bernhard Seeger,et al.  Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm , 2002, VLDB.

[7]  Francisco Araque,et al.  Real-time Data Warehousing with Temporal Requirements , 2003, CAiSE Workshops.

[8]  Evaggelia Pitoura,et al.  ETL queues for active data warehousing , 2005, IQIS '05.

[9]  Amr El Abbadi,et al.  Proceedings of the 26th International Conference on Very Large Data Bases , 1984, Very Large Data Bases Conference.

[10]  Jennifer Widom,et al.  Performance Issues in Incremental Warehouse Maintenance , 2000, VLDB.

[11]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[12]  A Min Tjoa,et al.  Zero-Latency Data Warehousing for Heterogeneous Data Sources and Continuous Data Streams , 2003, iiWAS.

[13]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[14]  John E. Gaffney,et al.  Estimating the Number of Faults in Code , 1984, IEEE Transactions on Software Engineering.

[15]  Panos Vassiliadis,et al.  Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[17]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[18]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[19]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[20]  M. W. Blasgen,et al.  Storage and Access in Relational Data Bases , 1977, IBM Syst. J..

[21]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.