Level-Wise Distribution of Wavelet Coefficients for Processing k NN Queries over Distributed Streams

We present LEEWAVE − a bandwidth-efficient approach to searching range-specifiedk-nearest neighbors among distributed streams by LEvEl-wise distribution of WAVElet coefficients. To find the k most similar streams to a range-specified reference one, the relevant wavelet coefficients of the reference stream can be sent to the peer sites to compute the similarities. However, bandwidth can be unnecessarily wasted if the entire relevant coefficients are sent simultaneously. Instead, we present a level-wise approach by leveraging the multi-resolution property of the wavelet coefficients. Starting from the top and moving down one level at a time, the query initiator sends only the single-level coefficients to a progressively shrinking set of candidates. However, there is one difficult challenge in LEEWAVE: how does the query initiator prune the candidates without knowing all the relevant coefficients? To overcome this challenge, we derive and maintain a similarity rangefor each candidate and gradually tighten the bounds of this range as we move from one level to the next. The increasingly tightened similarity ranges enable the query initiator to effectively prune the candidates without causing any false dismissal. Extensive experiments with real and synthetic data show that, when compared with prior approaches, LEEWAVE uses significantly less bandwidth under a wide range of conditions.

[1]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[2]  Tzi-cker Chiueh,et al.  Multi-resolution indexing for shape images , 1998, CIKM '98.

[3]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[4]  Marco Ferretti,et al.  A compact wavelet index for retrieval in image database , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[5]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[6]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[7]  Zhengrong Yao,et al.  Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching , 2002, CIKM '02.

[8]  Clement T. Yu,et al.  Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping , 2003, IEEE Trans. Knowl. Data Eng..

[9]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[10]  Xiaoyan Liu,et al.  Efficient k-NN Search on Streaming Data Series , 2003, SSTD.

[11]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[12]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[13]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[14]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[16]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[17]  Yannis Manolopoulos,et al.  Distributed Processing of Similarity Queries , 2004, Distributed and Parallel Databases.

[18]  Michael Stonebraker,et al.  Retrospective on Aurora , 2004, The VLDB Journal.

[19]  Beng Chin Ooi,et al.  Approximate NN queries on Streams with Guaranteed Error/performance Bounds , 2004, VLDB.

[20]  Dimitrios Gunopulos,et al.  Iterative Incremental Clustering of Time Series , 2004, EDBT.

[21]  Philip S. Yu,et al.  Resource-Aware Mining with Variable Granularities in Data Streams , 2004, SDM.

[22]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[23]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[24]  Ambuj K. Singh,et al.  Distributed data streams indexing using content-based routing paradigm , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[25]  Nikos Mamoulis,et al.  One-Pass Wavelet Synopses for Maximum-Error Metrics , 2005, VLDB.

[26]  Suman Nath,et al.  Tributaries and deltas: efficient and robust aggregation in sensor network streams , 2005, SIGMOD '05.

[27]  D. J. H. Garling,et al.  The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities by J. Michael Steele , 2005, Am. Math. Mon..

[28]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[29]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[30]  Ming-Syan Chen,et al.  Efficient range-constrained similarity search on wavelet synopses over multiple streams , 2006, CIKM '06.

[31]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[32]  D. Keren,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2006, TODS.

[33]  Philip S. Yu,et al.  Challenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitoring Application on System S , 2007, VLDB.