Discovering Approximate Functional Dependencies from Distributed Big Data

Approximate Functional Dependencies (AFDs) discovered from database relations have proven to be useful for various tasks, such as knowledge discovery, query optimization. Previous research has proposed different algorithms to discover AFDs from a centralized relational database. However, none of the proposed algorithms is designed to discover AFDs from distributed data. In this paper, we devise a scalable and efficient approach to discover AFDs from distributed big data and not tied to main memory requirements. To improve the efficiency of AFDs discovery, statistics of local data in each site are collected to filter and prune the candidate AFDs set at first. The AFDs are discovered in parallel after data redistribution. We balance the load as much as possible before the redistribution of data and prune the candidate AFDs set quickly after the redistribution of data. We evaluate our approach using real and synthetic big datasets and the results show that our approach is more efficient and scalable on large relations and the number of nodes.

[1]  Pan Wei,et al.  Functional Dependencies Discovering in Distributed Big Data , 2015 .

[2]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[3]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[4]  Howard J. Hamilton,et al.  Mining functional dependencies from data , 2007, Data Mining and Knowledge Discovery.

[5]  Aravind Kalavagattu MINING APPROXIMATE FUNCTIONAL DEPENDENCIES AS CONDENSED REPRESENTATIONS OF ASSOCIATION RULES , 2008 .

[6]  Subbarao Kambhampati,et al.  QPIAD: Query Processing over Incomplete Autonomous Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Subbarao Kambhampati,et al.  Query Processing over Incomplete Autonomous Databases , 2007, VLDB.

[8]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[9]  Jean-Marc Petit,et al.  Functional and approximate dependency mining: database and FCA points of view , 2002, J. Exp. Theor. Artif. Intell..

[10]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[11]  Tony T. Lee,et al.  An Infornation-Theoretic Analysis of Relational Databases—Part I: Data Dependencies and Information Metric , 1987, IEEE Transactions on Software Engineering.

[12]  Mehmet M. Dalkilic,et al.  CE: the Classifier-Estimator Framework for Data Mining , 1997, DS-7.