论文信息 - A memory-optimal many-to-many semi-stream join

A memory-optimal many-to-many semi-stream join

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

[1] Stephen G. Warren,et al. Edited synoptic cloud reports from ships and land stations over the globe , 1996 .

[2] S. Muthukrishnan,et al. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[3] Ramon Lawrence,et al. Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[4] Philippe Bonnet,et al. Towards Sensor Database Systems , 2001, Mobile Data Management.

[5] Frederick Reiss,et al. TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[6] Qiang Chen,et al. Aurora : a new model and architecture for data stream management ) , 2006 .

[7] Ajit Singh,et al. A partition-based approach to support streaming updates over persistent data in an active datawarehouse , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8] Abdul Sattar,et al. A new operator for efficient stream-relation join processing in data streaming engines , 2013, CIKM.

[9] Jeffrey F. Naughton,et al. Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[10] Yanlei Diao,et al. High-performance complex event processing over streams , 2006, SIGMOD Conference.

[11] Michael J. Franklin,et al. Pay-as-you-go data cleaning and integration , 2008 .

[12] R. Armstrong. The Long Tail: Why the Future of Business Is Selling Less of More , 2008 .

[13] Walid G. Aref,et al. Hash-merge join: a non-blocking join algorithm for producing fast and early join results , 2004, Proceedings. 20th International Conference on Data Engineering.

[14] Theodore Johnson,et al. Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15] David J. DeWitt,et al. NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[16] Evaggelia Pitoura,et al. ETL queues for active data warehousing , 2005, IQIS '05.

[17] Vasilis Vassalos,et al. Semi-Streamed Index Join for near-real time execution of ETL transformations , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18] A. N. Wilschut,et al. Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.

[19] Christof Lutteroth,et al. SSCJ: A Semi-Stream Cache Join Using a Front-Stage Cache Module , 2013, DaWaK.

[20] Panos Vassiliadis,et al. Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[21] Christof Lutteroth,et al. A generic front-stage for semi-stream processing , 2013, CIKM.

[22] Theodore Johnson,et al. Consistency in a Stream Warehouse , 2011, CIDR.

[23] M. W. Blasgen,et al. Storage and Access in Relational Data Bases , 1977, IBM Syst. J..

[24] Gerald Weber,et al. HYBRIDJOIN for Near-Real-Time Data Warehousing , 2011, Int. J. Data Warehous. Min..

[25] Gerald Weber,et al. An Event-Based Near Real-Time Data Integration Architecture , 2008, 2008 12th Enterprise Distributed Object Computing Conference Workshops.

[26] Hira Agrawal,et al. Stream query processing for healthcare bio-sensor applications , 2004, Proceedings. 20th International Conference on Data Engineering.

[27] Samuel Madden,et al. Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[28] Jin Zhang,et al. A demonstration of the MaxStream federated stream processing system , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[29] Alon Y. Halevy,et al. An adaptive query execution system for data integration , 1999, SIGMOD '99.

[30] Peter M. G. Apers,et al. Pipelining in query execution , 1990, Proceedings. PARBASE-90: International Conference on Databases, Parallel Architectures, and Their Applications.

[31] Theodore Johnson,et al. Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[32] Sharma Chakravarthy,et al. Stream Data Processing: A Quality of Service Perspective - Modeling, Scheduling, Load Shedding, and Complex Event Processing , 2009, Advances in Database Systems.

[33] Panos Vassiliadis,et al. Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34] Theodore Johnson,et al. Stream warehousing with DataDepot , 2009, SIGMOD Conference.

[35] Gerald Weber,et al. A Lightweight Stream-Based Join with Limited Resource Consumption , 2012, DaWaK.

[36] Mohammad Taghi Hajiaghayi,et al. Scheduling to minimize staleness and stretch in real-time data warehouses , 2009, SPAA '09.