Maintaining Consistency of Probabilistic Databases: A Linear Programming Approach

The problem of maintaining consistency via functional dependencies (FDs) has been studied and analyzed extensively within traditional database settings. There have also been many probabilistic data models proposed in the past decades. However, the problem of maintaining consistency in probabilistic relations via FDs is still unclear. In this paper, we clarify the concept of FDs in probabilistic relations and present an efficient chase algorithm LPChase(r,F) for maintaining consistency of a probabilistic relation r with respect to an FD set F. LPChase(r,F) adopts a novel approach that uses Linear Programming (LP) method to modify the probability of data values in r. There are many benefits of our approach. First, LPChase(r,F) guarantees that the output result is always the minimal change to r. Second, assuming that the expected size of an active domain consisting data values with non-zero probability is fixed, we demonstrate the interesting result that the LP solving time in LPChase(r,F) decreases as the probabilistic data domains grow, and becomes negligible for large domain size. On the other hand, the I/O time and modeling time become stable even when the domain size increases.

[1]  Gyula O. H. Katona,et al.  Functional dependencies distorted by errors , 2008, Discret. Appl. Math..

[2]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[3]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[4]  Mark Levene,et al.  Maintaining Consistency of Imprecise Relations , 1996, Comput. J..

[5]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[7]  D. Spielman,et al.  Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time , 2004 .

[8]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[9]  Sumit Sarkar,et al.  Generalized Normal Forms for Probabilistic Relational Data , 2002, IEEE Trans. Knowl. Data Eng..

[10]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[11]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[12]  Sergio Greco,et al.  Approximate Probabilistic Query Answering over Inconsistent Databases , 2008, ER.

[13]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[14]  Wilfred Ng,et al.  Maintaining consistency of vague databases using data dependencies , 2009, Data Knowl. Eng..