Cleaning uncertain data with quality guarantees

Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to "clean" the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.

[1]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[2]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[3]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Trans. Knowl. Data Eng..

[4]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  A. Prasad Sistla,et al.  Querying the Uncertain Position of Moving Objects , 1997, Temporal Databases, Dagstuhl.

[6]  Reynold Cheng,et al.  Quality-Aware Probing of Uncertain Data with Resource Constraints , 2008, SSDBM.

[7]  Dieter Pfoser,et al.  Capturing the Uncertainty of Moving-Object Representations , 1999, SSD.

[8]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[10]  A. Prasad Sistla,et al.  Updating and Querying Databases that Track Mobile Units , 1999, Distributed and Parallel Databases.

[11]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[12]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[13]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[14]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[15]  Xiang Lian,et al.  Monochromatic and bichromatic reverse skyline search over uncertain databases , 2008, SIGMOD Conference.

[16]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Dan Suciu,et al.  Towards correcting input data errors probabilistically using integrity constraints , 2006, MobiDE '06.

[18]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[19]  Michel de Rougemont The reliability of queries (extended abstract) , 1995, PODS '95.

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[21]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[22]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[24]  Hans-Peter Kriegel,et al.  Probabilistic Nearest-Neighbor Query on Uncertain Objects , 2007, DASFAA.

[25]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Kamesh Munagala,et al.  A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[27]  ChengReynold,et al.  Cleaning uncertain data with quality guarantees , 2008, VLDB 2008.

[28]  Zhenyu Liu,et al.  Cost-efficient processing of MIN/MAX queries over distributed sensors with uncertainty , 2005, SAC '05.

[29]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[31]  Michel de Rougemont,et al.  The Reliability of Queries. , 1995, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[32]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[33]  Jeffrey Scott Vitter,et al.  Greed sort: optimal deterministic sorting on parallel disks , 1995, JACM.

[34]  Gennady Diubin,et al.  The average behaviour of greedy algorithms for the knapsack problem: General distributions , 2003, Math. Methods Oper. Res..