Evaluating top-k queries over incomplete data streams

We study the problem of continuous monitoring of top-k queries over multiple non-synchronized streams. Assuming a sliding window model, this general problem has been a well addressed research topic in recent years. Most approaches, however, assume synchronized streams where all attributes of an object are known simultaneously to the query processing engine. In many streaming scenarios though, different attributes of an item are reported in separate non-synchronized streams which do not allow for exact score calculations. We present how the traditional notion of object dominance changes in this case such that the k dominance set still includes all and only those objects which have a chance of being among the top-k results in their life time. Based on this, we propose an exact algorithm which builds on generating multiple instances of the same object in a way that enables efficient object pruning. We show that even with object pruning the necessary storage for exact evaluation of top-k queries is linear in the size of the sliding window. As data should reside in main memory to provide fast answers in an online fashion and cope with high stream rates, storing all this data may not be possible with limited resources. We present an approximate algorithm which leverages correlation statistics of pairs of streams to evict more objects while maintaining accuracy. We evaluate the efficiency of our proposed algorithms with extensive experiments.

[1]  Srikanta Tirthapura,et al.  Sketching asynchronous streams over a sliding window , 2006, PODC '06.

[2]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[3]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[4]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[5]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[6]  Graham Cormode,et al.  Time-decaying aggregates in out-of-order streams , 2008, PODS.

[7]  Beng Chin Ooi,et al.  Efficiently Processing Continuous k-NN Queries on Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[9]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[10]  Kyriakos Mouratidis,et al.  Continuous Nearest Neighbor Queries over Sliding Windows , 2007 .

[11]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[12]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[13]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[14]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[15]  Jennifer Widom,et al.  Exploiting k-constraints to reduce memory overhead in continuous queries over data streams , 2004, TODS.

[16]  Yuguo Chen,et al.  On joining and caching stochastic streams , 2005, SIGMOD '05.

[17]  Dimitrios Gunopulos,et al.  Ad-hoc Top-k Query Answering for Data Streams , 2007, VLDB.

[18]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[19]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2004, Theory of Computing Systems.

[20]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[21]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[22]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[23]  Yuguo Chen,et al.  Efficient maintenance of materialized top-k views , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[24]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[25]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[27]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[28]  Philippe Flajolet,et al.  Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[29]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[30]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[31]  Karl Aberer,et al.  Top-k/w publish/subscribe: finding k most relevant publications in sliding time window w , 2008, DEBS.

[32]  Feifei Li,et al.  Characterizing and Exploiting Reference Locality in Data Stream Applications , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[33]  Beng Chin Ooi,et al.  Approximate NN queries on Streams with Guaranteed Error/performance Bounds , 2004, VLDB.