Distributed Implementations of Dependency Discovery Algorithms

We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets. PVLDB Reference Format: Hemant Saxena, Lukasz Golab, Ihab F. Ilyas. Distributed implementations of dependency discovery algorithms. PVLDB, 12(11): 1624-1636, 2019. DOI: https://doi.org/10.14778/3342263.3342638

[1]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[2]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[3]  Lukasz Golab,et al.  Distributed Discovery of Functional Dependencies , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[4]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[5]  George Havas,et al.  Distributed Algorithms for Depth-First Search , 1996, Inf. Process. Lett..

[6]  Nicolas Hanusse,et al.  Parallel mining of dependencies , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[7]  Tao Jiang,et al.  Discovering Approximate Functional Dependencies from Distributed Big Data , 2016, APWeb.

[8]  Divesh Srivastava,et al.  Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization , 2016, Proc. VLDB Endow..

[9]  John H. Reif,et al.  Depth-First Search is Inherently Sequential , 1985, Inf. Process. Lett..

[10]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[12]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[13]  Felix Naumann,et al.  A Hybrid Approach for Efficient Unique Column Combination Discovery , 2017, BTW.

[14]  Felix Naumann,et al.  A Hybrid Approach to Functional Dependency Discovery , 2016, SIGMOD Conference.

[15]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Felix Naumann,et al.  DFD: Efficient Functional Dependency Discovery , 2014, CIKM.

[18]  Felix Naumann,et al.  Efficient Denial Constraint Discovery with Hydra , 2017, Proc. VLDB Endow..

[19]  Tao Jiang,et al.  Discovering Functional Dependencies in Vertically Distributed Big Data , 2015, WISE.

[20]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.