Holistic Data Profiling: Simultaneous Discovery of Various Metadata

Data proling is the discipline of examining an unknown dataset for its structure and statistical information. It is a preprocessing step in a wide range of applications, such as data integration, data cleansing, or query optimization. For this reason, many algorithms have been proposed for the discovery of dierent kinds of metadata. When analyzing a dataset, these proling algorithms are often applied in sequence, but they do not support one another, for instance, by sharing I/O cost or pruning information. We present the holistic algorithm Muds, which jointly discovers the three most important metadata: inclusion dependencies, unique column combinations, and functional dependencies. By sharing I/O cost and data structures across the dierent discovery tasks, Muds can clearly increase the eciency of traditional sequential data proling. The algorithm also introduces novel inter-task pruning rules that build upon dierent types of metadata, e.g., unique column combinations to infer functional dependencies. We evaluate Muds in detail and compare it against the sequential execution of state-of-the-art algorithms. A comprehensive evaluation shows that our holistic algorithm outperforms the baseline by up to factor 48 on datasets with favorable pruning conditions.

[1]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[2]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[3]  Philip A. Bernstein,et al.  Computational problems related to the design of normal form relational schemas , 1979, TODS.

[4]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[5]  Felix Naumann,et al.  Advancing the discovery of unique column combinations , 2011, CIKM '11.

[6]  E. F. Codd,et al.  Further Normalization of the Data Base Relational Model , 1971, Research Report / RJ / IBM / San Jose, California.

[7]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[8]  Hossein Saiedian,et al.  An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema , 1996, Comput. J..

[9]  C. M. WyssMay Finding Minimal Keys in a Relation Instance , 1999 .

[10]  Rosine Cicchetti,et al.  FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies , 2001, ICDT.

[11]  Felix Naumann,et al.  Scalable Discovery of Unique Column Combinations , 2013, Proc. VLDB Endow..

[12]  Felix Naumann,et al.  Efficiently Computing Inclusion Dependencies for Schema Discovery , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[13]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[14]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[15]  Jean-Marc Petit,et al.  Efficient Algorithms for Mining Inclusion Dependencies , 2002, EDBT.