论文信息 - A generic front-stage for semi-stream processing

A generic front-stage for semi-stream processing

Recently, a number of semi-stream join algorithms have been published. The typical system setup for these consists of one fast stream input that has to be joined with a disk-based relation R. These semi-stream join approaches typically perform the join with a limited main memory partition assigned to them, which is generally not large enough to hold the whole relation R. We propose a caching approach that can be used as a front-stage for different semi-stream join algorithms, resulting in significant performance gains for common applications. We analyze our approach in the context of a seminal semi-stream join, MESHJOIN (Mesh Join), and provide a cost model for the resulting semi-stream join algorithm, which we call CMESHJOIN (Cached Mesh Join). The algorithm takes advantage of skewed distributions; this article presents results for Zipfian distributions of the type that appears in many applications.

[1] A. N. Wilschut,et al. Dataflow query execution in a parallel main-memory environment , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[2] Chris Anderson,et al. The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[3] Mohammad Taghi Hajiaghayi,et al. Scheduling to Minimize Staleness and Stretch in Real-Time Data Warehouses , 2009, SPAA '09.

[4] Ramon Lawrence,et al. Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[5] Jennifer Widom,et al. An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations , 2002 .

[6] M. Hitt. The Long Tail: Why the Future of Business Is Selling Less of More , 2007 .

[7] Gerald Weber,et al. R-MESHJOIN for near-real-time data warehousing , 2010, DOLAP '10.

[8] Xiaodong Zhang,et al. Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[9] Theodore Johnson,et al. Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10] Vasilis Vassalos,et al. Semi-Streamed Index Join for near-real time execution of ETL transformations , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11] Gerald Weber,et al. An Event-Based Near Real-Time Data Integration Architecture , 2008, 2008 12th Enterprise Distributed Object Computing Conference Workshops.

[12] Alon Y. Halevy,et al. An adaptive query execution system for data integration , 1999, SIGMOD '99.

[13] Peter M. G. Apers,et al. Pipelining in query execution , 1990, Proceedings. PARBASE-90: International Conference on Databases, Parallel Architectures, and Their Applications.

[14] Yanlei Diao,et al. High-performance complex event processing over streams , 2006, SIGMOD Conference.

[15] Frederick Reiss,et al. TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[16] M. Hart. The Long Tail: Why the Future of Business Is Selling Less of More by Chris Anderson , 2007 .

[17] Ajit Singh,et al. A partition-based approach to support streaming updates over persistent data in an active datawarehouse , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18] Theodore Johnson,et al. Stream warehousing with DataDepot , 2009, SIGMOD Conference.

[19] Panos Vassiliadis,et al. Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20] Theodore Johnson,et al. Consistency in a Stream Warehouse , 2011, CIDR.

[21] Panos Vassiliadis,et al. Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.