Comparable dependencies over heterogeneous data

To study the data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDS), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDS), metric functional dependencies (MFDS), and matching dependencies (MDS). As we illustrated, comparable dependencies are useful in real practice of dataspaces, such as semantic query optimization. Due to heterogeneous data in dataspaces, the first question, known as the validation problem, is to tell whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, such as greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.

[1]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[2]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[3]  Lei Chen,et al.  Differential dependencies: Reasoning and discovery , 2011, TODS.

[4]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[6]  Qi Cheng,et al.  Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database , 1999, VLDB.

[7]  John Grant,et al.  Logic-based approach to semantic query optimization , 1990, TODS.

[8]  Stefan Kramer,et al.  Compression-Based Evaluation of Partial Determinations , 1995, KDD.

[9]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[10]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Xi Zhang,et al.  Estimating the confidence of conditional functional dependencies , 2009, SIGMOD Conference.

[12]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[13]  László Lovász,et al.  Approximating clique is almost NP-complete , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[14]  Bernhard Pfahrinmx,et al.  Efficient Search for Strong Partial Determinations , 1996 .

[15]  Ronald S. King,et al.  Discovery of functional and approximate functional dependencies in relational databases , 2003, Adv. Decis. Sci..

[16]  Alon Y. Halevy,et al.  Indexing dataspaces , 2007, SIGMOD '07.

[17]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[18]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[19]  Jens Dittrich,et al.  iTrails: Pay-as-you-go Information Integration in Dataspaces , 2007, VLDB.

[20]  Elke A. Rundensteiner,et al.  Semantic Query Optimization for XQuery over XML Streams , 2005, VLDB.

[21]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[22]  Dana Ron,et al.  On Approximating the Minimum Vertex Cover in Sublinear Time and the Connection to Distributed Algorithms , 2007, Electron. Colloquium Comput. Complex..

[23]  D. Bitton,et al.  A feasibility and performance study of dependency inference (database design) , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[24]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[25]  Philip S. Yu,et al.  On data dependencies in dataspaces , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[27]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[28]  Edward L. Robertson,et al.  On approximation measures for functional dependencies , 2004, Inf. Syst..

[29]  George Karakostas,et al.  A better approximation ratio for the vertex cover problem , 2005, TALG.

[30]  Jan Chomicki,et al.  Semantic optimization techniques for preference queries , 2005, Inf. Syst..

[31]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[32]  Avishek Saha,et al.  Metric Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[33]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[34]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[35]  W. W. Armstrong,et al.  Dependency Structures of Data Base Relationships , 1974, IFIP Congress.

[36]  Yehoshua Sagiv,et al.  Semantic query optimization in Datalog programs (extended abstract) , 1995, ILPS Workshop: Constraints and Databases.

[37]  Heikki Mannila,et al.  Design of Relational Databases , 1992 .

[38]  Hong Cheng,et al.  Parameter-Free Determination of Distance Thresholds for Metric Distance Constraints , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[39]  Lei Chen,et al.  Materialization and Decomposition of Dataspaces for Efficient Search , 2011, IEEE Transactions on Knowledge and Data Engineering.

[40]  Daisy Zhe Wang,et al.  Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems , 2009, WebDB.

[41]  Alon Y. Halevy,et al.  Semantic query optimization in Datalog programs (extended abstract) , 1995, PODS '95.

[42]  Jens Dittrich,et al.  Intensional associations in dataspaces , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[43]  Jaikumar Radhakrishnan,et al.  Greed is good: Approximating independent sets in sparse and bounded-degree graphs , 1997, Algorithmica.

[44]  Leopoldo E. Bertossi,et al.  The complexity and approximation of fixing numerical attributes in databases under integrity constraints , 2008, Inf. Syst..

[45]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[46]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[47]  Craig A. Knoblock,et al.  Semantic Query Optimization for Query Plans of Heterogeneous Multidatabase Systems , 2000, IEEE Trans. Knowl. Data Eng..

[48]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.

[49]  Peter A. Flach,et al.  Database Dependency Discovery: A Machine Learning Approach , 1999, AI Commun..

[50]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[51]  Irit Dinur,et al.  The importance of being biased , 2002, STOC '02.

[52]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[53]  Heikki Mannila,et al.  Algorithms for Inferring Functional Dependencies from Relations , 1994, Data Knowl. Eng..

[54]  Heikki Mannila,et al.  Dependency Inference , 1987, VLDB.