Hitting set enumeration with partial information for unique column combination discovery

Unique column combinations (UCCs) are a fundamental concept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be discovered. State-of-the-art data profiling algorithms are able to efficiently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations. In this paper, we introduce HPIValid, a novel UCC discovery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata discovery as a hitting set enumeration problem in hypergraphs. In this way, it combines efficient discovery techniques from data profiling research with the most recent theoretical insights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint.

[1]  Rolf Niedermeier,et al.  Exploiting hidden structure in selecting dimensions that distinguish vectors , 2015, J. Comput. Syst. Sci..

[2]  Felix Naumann,et al.  Advancing the discovery of unique column combinations , 2011, CIKM '11.

[3]  Michel Habib,et al.  Into the Square: On the Complexity of Some Quadratic-time Solvable Problems , 2016, ICTCS.

[4]  Fabrizio Grandoni,et al.  Combinatorial bounds via measure and conquer: Bounding minimal dominating sets and applications , 2008, TALG.

[5]  Paolo Atzeni,et al.  Functional Dependencies and Constraints on Null Values in Database Relations , 1986, Inf. Control..

[6]  Felix Naumann,et al.  Efficient Denial Constraint Discovery with Hydra , 2017, Proc. VLDB Endow..

[7]  Raymond Fadous,et al.  Finding candidate keys for relational data bases , 1975, SIGMOD '75.

[8]  Takeaki Uno,et al.  Efficient algorithms for dualizing large-scale hypergraphs , 2011, Discret. Appl. Math..

[9]  Toshihide Ibaraki,et al.  Complexity of Identification and Dualization of Positive Boolean Functions , 1995, Inf. Comput..

[10]  Thomas Bläsius,et al.  The Minimization of Random Hypergraphs , 2019, ESA.

[11]  Georg Gottlob,et al.  Computational aspects of monotone dualization: A brief survey , 2008, Discret. Appl. Math..

[12]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[13]  Sebastian Link,et al.  Discovering Meaningful Certain Keys from Incomplete and Inconsistent Relations , 2016, IEEE Data Eng. Bull..

[14]  Christian Mancas Algorithms for Database Keys Discovery Assistance , 2016, BIR.

[15]  Sebastian Link,et al.  Possible and certain keys for SQL , 2016, The VLDB Journal.

[16]  Tobias Friedrich,et al.  The Parameterized Complexity of Dependency Detection in Relational Databases , 2016, IPEC.

[17]  Felix Naumann,et al.  A Hybrid Approach for Efficient Unique Column Combination Discovery , 2017, BTW.

[18]  Felix Naumann,et al.  A Hybrid Approach to Functional Dependency Discovery , 2016, SIGMOD Conference.

[19]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[20]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[21]  Sebastian Link,et al.  Discovery and Ranking of Embedded Uniqueness Constraints , 2019, Proc. VLDB Endow..

[22]  Nicolas Spyratos,et al.  Partition semantics for relations , 1985, PODS '85.

[23]  Felix Naumann,et al.  Data profiling , 2017, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[24]  Heikki Mannila,et al.  Algorithms for Inferring Functional Dependencies from Relations , 1994, Data Knowl. Eng..

[25]  Heikki Mannila,et al.  Dependency Inference , 1987, VLDB.

[26]  Ihab F. Ilyas,et al.  Approximate Denial Constraints , 2020, Proc. VLDB Endow..

[27]  Richard Statman,et al.  On the Structure of Armstrong Relations for Functional Dependencies , 1984, JACM.

[28]  Russell Impagliazzo,et al.  Completeness for First-order Properties on Sparse Structures with Algorithmic Applications , 2017, SODA.

[29]  Sebastian Link,et al.  Empirical evidence for the usefulness of Armstrong tables in the acquisition of semantically meaningful SQL constraints , 2015, Data Knowl. Eng..

[30]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[31]  Tobias Friedrich,et al.  Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling , 2018, ALENEX.

[32]  Felix Naumann,et al.  Detecting unique column combinations on dynamic data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[33]  Sebastian Link,et al.  Discovery and Ranking of Functional Dependencies , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[34]  Paola Vera-Licona,et al.  The minimal hitting set generation problem: algorithms and computation , 2016, SIAM J. Discret. Math..

[35]  Felix Naumann,et al.  Scalable Discovery of Unique Column Combinations , 2013, Proc. VLDB Endow..