Distributed Discovery of Functional Dependencies

We address the problem of discovering functional dependencies from distributed big data. Existing (non-distributed) algorithms such as FastFDs focus on minimizing computation. However, distributed algorithms must also optimize data communication costs, especially in shared-nothing settings. We propose a distributed version of FastFDs that is communication-efficient and we experimentally show significant performance improvements over a straightforward distributed implementation.

[1]  George Havas,et al.  Distributed Algorithms for Depth-First Search , 1996, Inf. Process. Lett..

[2]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[3]  Tao Jiang,et al.  Discovering Approximate Functional Dependencies from Distributed Big Data , 2016, APWeb.

[4]  John H. Reif,et al.  Depth-First Search is Inherently Sequential , 1985, Inf. Process. Lett..

[5]  Nicolas Hanusse,et al.  Parallel mining of dependencies , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[11]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.

[12]  Tao Jiang,et al.  Discovering Functional Dependencies in Vertically Distributed Big Data , 2015, WISE.