Resource optimization for processing of stream data in data warehouse environment

To fulfill the increasing demand of business for the latest information, current data integration approaches are moving towards real-time updates. In the case of real-time data integration the updates occurring on the source systems need to be reflected in the data warehouse immediately. One important element in real-time data integration is the join of a continuous incoming data stream with a disk-based master data. In this context a stream-based algorithm called X-HYBRIDJOIN (Extended Hybrid Join) has been proposed earlier, with a favorable asymptotic runtime behavior. However, the absolute performance was not as good as hoped for. In this paper we present results showing that through properly tuning the algorithm, the resulting Tuned X-HYBRIDJOIN performs significantly better than that of the previous X-HYBRIDJOIN, and better as other applicable join operators found in literature. We present the tuning approach, based on measurement techniques and a revised cost model. To evaluate the algorithm's performance we conduct an experimental study that shows that Tuned X-HYBRIDJOIN exhibits the desired performance characteristics.

[1]  Chris Anderson,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[2]  Gerald Weber,et al.  Comparing Global Optimization and Default Settings of Stream-Based Joins - (Experimental Paper) , 2009, BIRTE.

[3]  A Min Tjoa,et al.  Zero-Latency Data Warehousing for Heterogeneous Data Sources and Continuous Data Streams , 2003, iiWAS.

[4]  Gerald Weber,et al.  X-HYBRIDJOIN for Near-Real-Time Data Warehousing , 2011, BNCOD.

[5]  Evaggelia Pitoura,et al.  ETL queues for active data warehousing , 2005, IQIS '05.

[6]  Ramon Lawrence,et al.  Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[7]  Panos Vassiliadis,et al.  Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[9]  Jennifer Widom,et al.  Performance Issues in Incremental Warehouse Maintenance , 2000, VLDB.

[10]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[11]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[12]  Gerald Weber,et al.  An Event-Based Near Real-Time Data Integration Architecture , 2008, 2008 12th Enterprise Distributed Object Computing Conference Workshops.

[13]  Leonard D. Shapiro,et al.  Join processing in database systems with large main memories , 1986, TODS.

[14]  Priscilla S. Markwood,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[15]  Gerald Weber,et al.  HYBRIDJOIN for Near-Real-Time Data Warehousing , 2011, Int. J. Data Warehous. Min..

[16]  Bernard J. Jansen The Long Tail: Why the Future of Business is Selling Less or More, Chris Anderson. Hyperion, New York (2006), $24.95, ISBN: 1-4013-0237-8 , 2007 .

[17]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.