Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes

Functional dependencies are important for the definition of constraints and relationships that have to be satisfied by every database instance. Relaxed functional dependencies (RFDs) can be used for data exploration and profiling in datasets with lower data quality. In this work, we present an approach for RFD discovery in heterogeneous data lakes. More specifically, the goal of this work is to find RFDs from structured, semi-structured, and graph data. Our solution brings novelty to this problem in the following aspects: (1) We introduce a generic metamodel to the problem of RFD discovery, which allows us to define and detect RFDs for data stored in heterogeneous sources in an integrated manner. (2) We apply clustering techniques during RFD discovery for partitioning and pruning. (3) We performed an intensive evaluation with nine datasets, which shows that our approach is effective for discovering meaningful RFDs, reducing redundancy, and detecting inconsistent data.

[1]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[2]  Matthias Jarke,et al.  Generic schema mappings for composition and query answering , 2009, Data Knowl. Eng..

[3]  Cong Yu,et al.  XML schema refinement through redundancy detection and normalization , 2008, The VLDB Journal.

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[7]  Christoph Quix,et al.  Nested Schema Mappings for Integrating JSON , 2018, ER.

[8]  Jef Wijsen,et al.  Neighborhood Dependencies for Prediction , 2001, PAKDD.

[9]  Cory J. Butz,et al.  FD/spl I.bar/Mine: discovering functional dependencies in a database using equivalences , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[11]  Felix Naumann,et al.  Efficient Discovery of Approximate Dependencies , 2018, Proc. VLDB Endow..

[12]  Jeff Heflin,et al.  Extending Functional Dependency to Detect Abnormal Data in RDF Graphs , 2011, SEMWEB.

[13]  Bettina Fazzinga,et al.  Approximate Functional Dependencies for XML Data , 2007, ADBIS Research Communications.

[14]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[15]  Olaf Zimmermann,et al.  Extending a Secure System Development Methodology to SOA , 2007 .

[16]  Christoph Quix,et al.  Rewriting of Plain SO Tgds into Nested Tgds , 2019, Proc. VLDB Endow..

[17]  Christoph Quix,et al.  Query Rewriting for Heterogeneous Data Lakes , 2018, ADBIS.

[18]  E. LESTER SMITH,et al.  AND OTHERS , 2005 .

[19]  Giuseppe Polese,et al.  Relaxed Functional Dependencies—A Survey of Approaches , 2016, IEEE Transactions on Knowledge and Data Engineering.

[20]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .