Querying uncertain data with aggregate constraints

Data uncertainty arises in many situations. A common approach to query processing uncertain data is to sample many "possible worlds" from the uncertain data and to run queries against the possible worlds. However, sampling is not a trivial task, as a randomly sampled possible world may not satisfy known constraints imposed on the data. In this paper, we focus on an important category of constraints, the aggregate constraints. An aggregate constraint is placed on a set of records instead of on a single record, and a real-life system usually has a large number of aggregate constraints. It is a challenging task to find qualified possible worlds in this scenario, since tuple by tuple sampling is extremely inefficient because it rarely leads to a qualified possible world. In this paper, we introduce two approaches for querying uncertain data with aggregate constraints: constraint aware sampling and MCMC sampling. Our experiments show that the new approaches lead to high quality query results with reasonable cost.

[1]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[2]  Hector J. Levesque,et al.  A New Method for Solving Hard Satisfiability Problems , 1992, AAAI.

[3]  Haixun Wang,et al.  Web Scale Entity Resolution using Relational Evidence , 2011 .

[4]  Yanlei Diao,et al.  SASE: Complex Event Processing over Streams , 2006, ArXiv.

[5]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  William M. Spears,et al.  Simulated annealing for hard satisfiability problems , 1993, Cliques, Coloring, and Satisfiability.

[7]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[8]  Haixun Wang,et al.  Leveraging spatio-temporal redundancy for RFID data cleansing , 2010, SIGMOD Conference.

[9]  Haixun Wang,et al.  Distance-Constraint Reachability Computation in Uncertain Graphs , 2011, Proc. VLDB Endow..

[10]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[12]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Bart Selman,et al.  Counting CSP Solutions Using Generalized XOR Constraints , 2007, AAAI.

[14]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[15]  L. Khachiyan,et al.  The polynomial solvability of convex quadratic programming , 1980 .

[16]  Philip E. Gill,et al.  Numerical Linear Algebra and Optimization , 1991 .

[17]  Sunil Prabhakar,et al.  U-DBMS: A Database System for Managing Constantly-Evolving Data , 2005, VLDB.

[18]  Andreas Kuehlmann,et al.  A Markov Chain Monte Carlo Sampler for Mixed Boolean/Integer Constraints , 2009, CAV.

[19]  Christopher Ré,et al.  The trichotomy of HAVING queries on a probabilistic database , 2009, The VLDB Journal.

[20]  Bart Selman,et al.  Domain-Independent Extensions to GSAT : Solving Large StructuredSatis ability , 1993 .

[21]  Rajasekar Krishnamurthy,et al.  Uncertainty management in rule-based information extraction systems , 2009, SIGMOD Conference.

[22]  Bart Selman,et al.  Local search strategies for satisfiability testing , 1993, Cliques, Coloring, and Satisfiability.

[23]  Dan Olteanu,et al.  Query language support for incomplete information in the MayBMS system , 2007, VLDB.

[24]  Edward P. K. Tsang,et al.  Foundations of constraint satisfaction , 1993, Computation in cognitive science.

[25]  Ingemar J. Cox,et al.  Risky business: modeling and exploiting uncertainty in information retrieval , 2009, SIGIR.

[26]  Michael A. Saunders,et al.  Procedures for optimization problems with a mixture of bounds and general linear constraints , 1984, ACM Trans. Math. Softw..

[27]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[28]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[29]  C.H. Papadimitriou,et al.  On selecting a satisfying truth assignment , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[30]  Philip S. Yu,et al.  A Sampling-Based Approach to Information Recovery , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[31]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[32]  Haixun Wang,et al.  Cleansing uncertain databases leveraging aggregate constraints , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).