论文信息 - Continuously monitoring top-k uncertain data streams: a probabilistic threshold method

Continuously monitoring top-k uncertain data streams: a probabilistic threshold method

Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In this paper, we formulate a novel and challenging problem of continuously monitoring top-k uncertain data streams, and propose a probabilistic threshold method. We develop four algorithms systematically: a deterministic exact algorithm, a randomized method, and their space-efficient versions using quantile summaries. An extensive empirical study using real data sets and synthetic data sets is reported to verify the effectiveness and the efficiency of our methods.

Jian Pei | Ming Hua | J. Pei | Ming Hua

[1] Divesh Srivastava,et al. On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[2] Christopher Ré,et al. Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3] Anupam Gupta,et al. Counting inversions in lists , 2003, SODA '03.

[4] Surya Nepal,et al. Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5] Xiang Lian,et al. Probabilistic ranked queries in uncertain databases , 2008, EDBT '08.

[6] Andrew McGregor,et al. Estimating statistical aggregates on probabilistic data streams , 2007, PODS.

[7] Dimitrios Gunopulos,et al. Answering top-k queries using views , 2006, VLDB.

[8] Divesh Srivastava,et al. Effective computation of biased quantiles over data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9] Dan Suciu,et al. The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[10] Dan Olteanu,et al. $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[11] S. Muthukrishnan,et al. How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[12] Claudio Gutierrez,et al. Survey of graph database models , 2008, CSUR.

[13] Chun Zhang,et al. Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[14] Gurmeet Singh Manku,et al. Approximate counts and quantiles over sliding windows , 2004, PODS.

[15] David J. DeWitt,et al. NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[16] Christopher Olston,et al. Distributed top-k monitoring , 2003, SIGMOD '03.

[17] Dimitrios Gunopulos,et al. Ad-hoc Top-k Query Answering for Data Streams , 2007, VLDB.

[18] Jeffrey Scott Vitter,et al. Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[19] Jennifer Widom,et al. Continuous queries over data streams , 2001, SGMD.

[20] Sanjeev Khanna,et al. Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[21] Song Han,et al. A Statistics-Based Sensor Selection Scheme for Continuous Probabilistic Queries in Sensor Networks , 2005, RTCSA.

[22] Jennifer Widom,et al. Models and issues in data stream systems , 2002, PODS.

[23] Sumit Sarkar,et al. A probabilistic relational model and algebra , 1996, TODS.

[24] Bruce G. Lindsay,et al. Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[25] Graham Kalton,et al. Introduction to Survey Sampling , 1983 .

[26] Serge Abiteboul,et al. On the Representation and Querying of Sets of Possible Worlds , 1991, Theor. Comput. Sci..

[27] Susanne E. Hambrusch,et al. Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28] Wei Hong,et al. Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[29] Gerhard Weikum,et al. KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[30] Hongjun Lu,et al. Continuously maintaining quantile summaries of the most recent N elements over a data stream , 2004, Proceedings. 20th International Conference on Data Engineering.

[31] David J. DeWitt,et al. NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.

[32] Dan Olteanu,et al. 10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[33] Moni Naor,et al. Optimal aggregation algorithms for middleware , 2001, PODS.

[34] Divesh Srivastava,et al. Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS.

[35] T. S. Jayram,et al. Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[36] A. Doucet,et al. Particle filtering for multi-target tracking and sensor management , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[37] Kenneth Lange,et al. Numerical analysis for statisticians , 1999 .

[38] Jian Pei,et al. Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[39] Jian Xu,et al. Space-efficient Relative Error Order Sketch over Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[40] J. Ian Munro,et al. Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[41] Leslie G. Valiant,et al. Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[42] Susanne E. Hambrusch,et al. Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[43] Graham Cormode,et al. Sketching probabilistic data streams , 2007, SIGMOD '07.

[44] Jennifer Widom,et al. Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[45] W. Hoeffding. On the Distribution of the Number of Successes in Independent Trials , 1956 .

[46] Ihab F. Ilyas,et al. A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[47] Amol Deshpande,et al. Ef?cient Query Evaluation over Temporally Correlated Probabilistic Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[48] Suk Kyoon Lee,et al. Imprecise and uncertain information in databases: an evidential approach , 1992, [1992] Eighth International Conference on Data Engineering.

[49] Mohamed A. Soliman,et al. Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[50] Yufei Tao,et al. Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[51] Dan Suciu,et al. Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[52] Sunil Prabhakar,et al. Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[53] Dan Suciu,et al. Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[54] Xi Zhang,et al. On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[55] Bruce G. Lindsay,et al. Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[56] Tomasz Imielinski,et al. Incomplete Information in Relational Databases , 1984, JACM.

[57] Kyriakos Mouratidis,et al. Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[58] Jian Pei,et al. Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[59] Zhenyu Liu,et al. Cost-efficient processing of MIN/MAX queries over distributed sensors with uncertainty , 2005, SAC '05.

[60] Peter Buneman,et al. Semistructured data , 1997, PODS.

[61] Susanne E. Hambrusch,et al. Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.