A generic front-stage for semi-stream processing

Recently, a number of semi-stream join algorithms have been published. The typical system setup for these consists of one fast stream input that has to be joined with a disk-based relation R. These semi-stream join approaches typically perform the join with a limited main memory partition assigned to them, which is generally not large enough to hold the whole relation R. We propose a caching approach that can be used as a front-stage for different semi-stream join algorithms, resulting in significant performance gains for common applications. We analyze our approach in the context of a seminal semi-stream join, MESHJOIN (Mesh Join), and provide a cost model for the resulting semi-stream join algorithm, which we call CMESHJOIN (Cached Mesh Join). The algorithm takes advantage of skewed distributions; this article presents results for Zipfian distributions of the type that appears in many applications.

[1]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[2]  Chris Anderson,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[3]  Mohammad Taghi Hajiaghayi,et al.  Scheduling to Minimize Staleness and Stretch in Real-Time Data Warehouses , 2009, SPAA '09.

[4]  Ramon Lawrence,et al.  Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[5]  Jennifer Widom,et al.  An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations , 2002 .

[6]  M. Hitt The Long Tail: Why the Future of Business Is Selling Less of More , 2007 .

[7]  Gerald Weber,et al.  R-MESHJOIN for near-real-time data warehousing , 2010, DOLAP '10.

[8]  Xiaodong Zhang,et al.  Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[9]  Theodore Johnson,et al.  Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Vasilis Vassalos,et al.  Semi-Streamed Index Join for near-real time execution of ETL transformations , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Gerald Weber,et al.  An Event-Based Near Real-Time Data Integration Architecture , 2008, 2008 12th Enterprise Distributed Object Computing Conference Workshops.

[12]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[13]  Peter M. G. Apers,et al.  Pipelining in query execution , 1990, Proceedings. PARBASE-90: International Conference on Databases, Parallel Architectures, and Their Applications.

[14]  Yanlei Diao,et al.  High-performance complex event processing over streams , 2006, SIGMOD Conference.

[15]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[16]  M. Hart The Long Tail: Why the Future of Business Is Selling Less of More by Chris Anderson , 2007 .

[17]  Ajit Singh,et al.  A partition-based approach to support streaming updates over persistent data in an active datawarehouse , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  Theodore Johnson,et al.  Stream warehousing with DataDepot , 2009, SIGMOD Conference.

[19]  Panos Vassiliadis,et al.  Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Theodore Johnson,et al.  Consistency in a Stream Warehouse , 2011, CIDR.

[21]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.