GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms

Functional dependencies (FDs) are important metadata that describe relationships among columns of datasets and can be used in a number of tasks, such as schema normalization, data cleansing. In modern big data environments, data are partitioned, so that single-node FD discovery algorithms are inefficient without parallelization. However, existing parallel distributed algorithms bring huge communication costs and thus perform not well enough.