Defining and Mining Functional Dependencies in Probabilistic Databases

Functional dependencies -- traditional, approximate and conditional are of critical importance in relational databases, as they inform us about the relationships between attributes. They are useful in schema normalization, data rectification and source selection. Most of these were however developed in the context of deterministic data. Although uncertain databases have started receiving attention, these dependencies have not been defined for them, nor are fast algorithms available to evaluate their confidences. This paper defines the logical extensions of various forms of functional dependencies for probabilistic databases and explores the connections between them. We propose a pruning-based exact algorithm to evaluate the confidence of functional dependencies, a Monte-Carlo based algorithm to evaluate the confidence of approximate functional dependencies and algorithms for their conditional counterparts in probabilistic databases. Experiments are performed on both synthetic and real data evaluating the performance of these algorithms in assessing the confidence of dependencies and mining them from data. We believe that having these dependencies and algorithms available for probabilistic databases will drive adoption of probabilistic data storage in the industry.

[1]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[2]  S. Madden,et al.  UPI: A Primary Index for Uncertain Databases , 2010, Proc. VLDB Endow..

[3]  Subbarao Kambhampati,et al.  Query processing over incomplete autonomous databases: query rewriting using learned data dependencies , 2009, The VLDB Journal.

[4]  Philip A. Bernstein,et al.  Synthesizing third normal form relations from functional dependencies , 1976, TODS.

[5]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[6]  Daisy Zhe Wang,et al.  Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems , 2009, WebDB.

[7]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[8]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Subbarao Kambhampati,et al.  Query Processing over Incomplete Autonomous Databases , 2007, VLDB.

[10]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[11]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[12]  Aravind Kalavagattu MINING APPROXIMATE FUNCTIONAL DEPENDENCIES AS CONDENSED REPRESENTATIONS OF ASSOCIATION RULES , 2008 .

[13]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[14]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[15]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[16]  Jennifer Widom,et al.  Schema Design for Uncertain Databases , 2007, AMW.

[17]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.