PROUD: a probabilistic approach to processing similarity queries over uncertain data streams

We present PROUD -- A PRObabilistic approach to processing similarity queries over Uncertain Data streams, where the data streams here are mainly time series streams. In contrast to data with certainty, an uncertain series is an ordered sequence of random variables. The distance between two uncertain series is also a random variable. We use a general uncertain data model, where only the mean and the deviation of each random variable at each timestamp are available. We derive mathematical conditions for progressively pruning candidates to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are available. Extensive experiments are conducted to evaluate the effectiveness of PROUD and compare it with Det, a deterministic approach that directly processes data without considering uncertainty. The results show that, compared with Det, PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. In contrast, Det does not provide such flexibility. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.

[1]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Philip S. Yu,et al.  LeeWave: level-wise distribution of wavelet coefficients for processing kNN queries over distributed streams , 2008, Proc. VLDB Endow..

[3]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[4]  Ambuj K. Singh,et al.  APLA: Indexing Arbitrary Probability Distributions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Ming-Syan Chen,et al.  Efficient range-constrained similarity search on wavelet synopses over multiple streams , 2006, CIKM '06.

[6]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[9]  C. C. Heyde,et al.  Central Limit Theorem , 2006 .

[10]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[11]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[12]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[14]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[16]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2007, PODS.

[18]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  G. Oehlert A note on the delta method , 1992 .

[20]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[21]  Charu C. Aggarwal On Unifying Privacy and Uncertain Data Models , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[23]  Dimitris Sacharidis,et al.  Fast Approximate Wavelet Tracking on Streams , 2006, EDBT.

[24]  Philip S. Yu,et al.  On High Dimensional Indexing of Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[25]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[27]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[28]  Jian Pei,et al.  Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[30]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[31]  Nikos Mamoulis,et al.  One-Pass Wavelet Synopses for Maximum-Error Metrics , 2005, VLDB.

[32]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[33]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[34]  Xiang Lian,et al.  Pattern Matching over Cloaked Time Series , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[36]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[37]  C. Koch,et al.  Worlds and Beyond : Effcient Representation and Processing of Incomplete Information , 2007 .