Discovering Meaningful Certain Keys from Incomplete and Inconsistent Relations

Completeness and consistency are two important dimensions for the quality of data, in particular relational data. This is true because most data sets found in practice are both incomplete and inconsistent. The simplest yet arguably most important integrity constraint are keys. Recently, certain keys were introduced for incomplete relations. Certain keys can efficiently manage the integrity of entities while still permitting incompleteness in columns of the key. It is therefore an important task to discover the set of certain keys that hold in a given incomplete relation. However, if the given incomplete relation is also inconsistent with respect to some meaningful certain keys, algorithms that discover keys cannot succeed. As meaningful keys are likely to have a small number of violations, we propose an algorithm that discovers certain keys that do not exceed a given number of violations. We illustrate the effectiveness and efficiency of our algorithm in discovering meaningful certain keys from publicly available data sets.

[1]  Sebastian Link,et al.  Probabilistic Keys for Data Quality Management , 2015, CAiSE.

[2]  Sebastian Link,et al.  Technical Correspondence: “Differential Dependencies: Reasoning and Discovery” Revisited , 2015, TODS.

[3]  Hong Cheng,et al.  On Concise Set of Relative Candidate Keys , 2014, Proc. VLDB Endow..

[4]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[5]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[6]  Heikki Mannila,et al.  Dependency Inference , 1987, VLDB.

[7]  Shazia Sadiq,et al.  Handbook of Data Quality , 2013, Springer Berlin Heidelberg.

[8]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[9]  Y. Edmund Lien,et al.  On the Equivalence of Database Models , 1982, JACM.

[10]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[11]  ZhouXiaofang,et al.  Possible and certain SQL keys , 2015, VLDB 2015.

[12]  Felix Naumann,et al.  A Hybrid Approach to Functional Dependency Discovery , 2016, SIGMOD Conference.

[13]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[14]  Beng Chin Ooi,et al.  On multi-column foreign key discovery , 2010, Proc. VLDB Endow..

[15]  Felix Naumann,et al.  Data Anamnesis: Admitting Raw Data into an Organization , 2016, IEEE Data Eng. Bull..

[16]  Michael R. Fellows,et al.  Fundamentals of Parameterized Complexity , 2013 .

[17]  Jef Wijsen,et al.  The Data Complexity of Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints , 2015, ACM Trans. Database Syst..

[18]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[19]  Sebastian Link,et al.  Logical Foundations of Possibilistic Keys , 2014, JELIA.

[20]  Sven Hartmann,et al.  Efficient reasoning about a robust XML key fragment , 2009, TODS.

[21]  Lei Chen,et al.  Differential dependencies: Reasoning and discovery , 2011, TODS.

[22]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[23]  Sebastian Link,et al.  Inclusion Dependencies Reloaded , 2015, CIKM.

[24]  Sebastian Link,et al.  SQL Data Profiling of Foreign Keys , 2015, ER.

[25]  Sebastian Link,et al.  SQL Schema Design: Foundations, Normal Forms, and Normalization , 2016, SIGMOD Conference.

[26]  Divesh Srivastava,et al.  Data quality: The other face of Big Data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[27]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[28]  E. F. Codd,et al.  The Relational Model for Database Management, Version 2 , 1990 .