A memory-optimal many-to-many semi-stream join

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

[1]  Stephen G. Warren,et al.  Edited synoptic cloud reports from ships and land stations over the globe , 1996 .

[2]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[3]  Ramon Lawrence,et al.  Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[4]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[5]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[6]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[7]  Ajit Singh,et al.  A partition-based approach to support streaming updates over persistent data in an active datawarehouse , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Abdul Sattar,et al.  A new operator for efficient stream-relation join processing in data streaming engines , 2013, CIKM.

[9]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[10]  Yanlei Diao,et al.  High-performance complex event processing over streams , 2006, SIGMOD Conference.

[11]  Michael J. Franklin,et al.  Pay-as-you-go data cleaning and integration , 2008 .

[12]  R. Armstrong The Long Tail: Why the Future of Business Is Selling Less of More , 2008 .

[13]  Walid G. Aref,et al.  Hash-merge join: a non-blocking join algorithm for producing fast and early join results , 2004, Proceedings. 20th International Conference on Data Engineering.

[14]  Theodore Johnson,et al.  Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[16]  Evaggelia Pitoura,et al.  ETL queues for active data warehousing , 2005, IQIS '05.

[17]  Vasilis Vassalos,et al.  Semi-Streamed Index Join for near-real time execution of ETL transformations , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.

[19]  Christof Lutteroth,et al.  SSCJ: A Semi-Stream Cache Join Using a Front-Stage Cache Module , 2013, DaWaK.

[20]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[21]  Christof Lutteroth,et al.  A generic front-stage for semi-stream processing , 2013, CIKM.

[22]  Theodore Johnson,et al.  Consistency in a Stream Warehouse , 2011, CIDR.

[23]  M. W. Blasgen,et al.  Storage and Access in Relational Data Bases , 1977, IBM Syst. J..

[24]  Gerald Weber,et al.  HYBRIDJOIN for Near-Real-Time Data Warehousing , 2011, Int. J. Data Warehous. Min..

[25]  Gerald Weber,et al.  An Event-Based Near Real-Time Data Integration Architecture , 2008, 2008 12th Enterprise Distributed Object Computing Conference Workshops.

[26]  Hira Agrawal,et al.  Stream query processing for healthcare bio-sensor applications , 2004, Proceedings. 20th International Conference on Data Engineering.

[27]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[28]  Jin Zhang,et al.  A demonstration of the MaxStream federated stream processing system , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[29]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[30]  Peter M. G. Apers,et al.  Pipelining in query execution , 1990, Proceedings. PARBASE-90: International Conference on Databases, Parallel Architectures, and Their Applications.

[31]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[32]  Sharma Chakravarthy,et al.  Stream Data Processing: A Quality of Service Perspective - Modeling, Scheduling, Load Shedding, and Complex Event Processing , 2009, Advances in Database Systems.

[33]  Panos Vassiliadis,et al.  Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Theodore Johnson,et al.  Stream warehousing with DataDepot , 2009, SIGMOD Conference.

[35]  Gerald Weber,et al.  A Lightweight Stream-Based Join with Limited Resource Consumption , 2012, DaWaK.

[36]  Mohammad Taghi Hajiaghayi,et al.  Scheduling to minimize staleness and stretch in real-time data warehouses , 2009, SPAA '09.