Continuously monitoring top-k uncertain data streams: a probabilistic threshold method

Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In this paper, we formulate a novel and challenging problem of continuously monitoring top-k uncertain data streams, and propose a probabilistic threshold method. We develop four algorithms systematically: a deterministic exact algorithm, a randomized method, and their space-efficient versions using quantile summaries. An extensive empirical study using real data sets and synthetic data sets is reported to verify the effectiveness and the efficiency of our methods.

[1]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[2]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Anupam Gupta,et al.  Counting inversions in lists , 2003, SODA '03.

[4]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Xiang Lian,et al.  Probabilistic ranked queries in uncertain databases , 2008, EDBT '08.

[6]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2007, PODS.

[7]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[8]  Divesh Srivastava,et al.  Effective computation of biased quantiles over data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[10]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[12]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[13]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[14]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[15]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[16]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[17]  Dimitrios Gunopulos,et al.  Ad-hoc Top-k Query Answering for Data Streams , 2007, VLDB.

[18]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[19]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[20]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[21]  Song Han,et al.  A Statistics-Based Sensor Selection Scheme for Continuous Probabilistic Queries in Sensor Networks , 2005, RTCSA.

[22]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[23]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[24]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[25]  Graham Kalton,et al.  Introduction to Survey Sampling , 1983 .

[26]  Serge Abiteboul,et al.  On the Representation and Querying of Sets of Possible Worlds , 1991, Theor. Comput. Sci..

[27]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[29]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[30]  Hongjun Lu,et al.  Continuously maintaining quantile summaries of the most recent N elements over a data stream , 2004, Proceedings. 20th International Conference on Data Engineering.

[31]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.

[32]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[33]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[34]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS.

[35]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[36]  A. Doucet,et al.  Particle filtering for multi-target tracking and sensor management , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[37]  Kenneth Lange,et al.  Numerical analysis for statisticians , 1999 .

[38]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[39]  Jian Xu,et al.  Space-efficient Relative Error Order Sketch over Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[40]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[41]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[42]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[43]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[44]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[45]  W. Hoeffding On the Distribution of the Number of Successes in Independent Trials , 1956 .

[46]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[47]  Amol Deshpande,et al.  Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[48]  Suk Kyoon Lee,et al.  Imprecise and uncertain information in databases: an evidential approach , 1992, [1992] Eighth International Conference on Data Engineering.

[49]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[50]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[51]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[52]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[53]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[54]  Xi Zhang,et al.  On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[55]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[56]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[57]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[58]  Jian Pei,et al.  Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[59]  Zhenyu Liu,et al.  Cost-efficient processing of MIN/MAX queries over distributed sensors with uncertainty , 2005, SAC '05.

[60]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[61]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.