Handling Uncertain Data in Array Database Systems

Scientific and intelligence applications have special data handling needs. In these settings, data does not fit the standard model of short coded records that had dominated the data management area for three decades. Array database systems have a specialized architecture to address this problem. Since the data is typically an approximation of reality, it is important to be able to handle imprecision and uncertainty in an efficient and provably accurate way. We propose a discrete approach for value distributions and adopt a standard metric (i.e., variation distance) in probability theory to measure the quality of a result distribution. We then propose a novel algorithm that has a provable upper bound on the variation distance between its result distribution and the "ideal" one. Complementary to that, we advocate the usage of a "statistical mode" suitable for the results of many queries and applications, which is also much more efficient for execution. We show how the statistical mode also presents interesting predicate evaluation strategies. In addition, extensive experiments are performed on real world datasets to evaluate our algorithms.

[1]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[2]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[3]  F. Tödtling,et al.  One size fits all?: Towards a differentiated regional innovation policy approach , 2005 .

[4]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[5]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[6]  James Stewart,et al.  Calculus: Concepts and Contexts , 1999 .

[7]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[8]  Michael Stonebraker,et al.  One Size Fits All? - Part 2: Benchmarking Results , 2007 .

[9]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[10]  Doron Rotem,et al.  Simple Random Sampling from Relational Databases , 1986, VLDB.

[11]  FuhrNorbert,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997 .

[12]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[13]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[14]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[15]  J. Halton A Retrospective and Prospective Survey of the Monte Carlo Method , 1970 .

[16]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[17]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[18]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[19]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  P. Jones,et al.  Uncertainty estimates in regional and global observed temperature changes: A new data set from 1850 , 2006 .

[22]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.