Detection of Multiple Function Dependency Violations for Distributed Big Data

It is usually necessary to move data from one site to another when detecting function dependency violations under distributed data environment, which leads to low efficiency in big data processing. In this paper, a novel detection method of multiple function dependency violations was proposed based on the concept of equivalence class, and a response time cost model for the method was provided. Because it is a NP-hard for function dependency violation detection to allocate tasks under distributed environment, we converted response time minimum of violation detection into an integer programming problem, and provided near-optimal solution. Aiming at the different cluster scale and the number of function dependencies, different task assignment policies were provided, and load balancing problem was also considered adequately. The experimental results on real and artificial data set show that, compared to the centralized detection methods on Hadoop 2.0, the proposed method in the paper has an obvious efficiency promotion and good extensibility in big data processing.

[1]  Maryam Ghasemaghaei,et al.  Can big data improve firm decision quality? The role of data quality and data diagnosticity , 2019, Decis. Support Syst..

[2]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[3]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[4]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[5]  KoufakouAnna,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010 .

[6]  Jugal K. Kalita,et al.  A Survey of Outlier Detection Methods in Network Anomaly Identification , 2011, Comput. J..

[7]  Hamideh Afsarmanesh,et al.  Pay-As-You-Go Data Integration Using Functional Dependencies , 2012, CD-ARES.

[8]  Rajeev Rastogi,et al.  Efficient Detection of Distributed Constraint Violations , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Ming Fang Maintaining Integrity Constraints in Semantic Web , 2013 .

[10]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[11]  Victor C. S. Lee,et al.  Building decision trees using functional dependencies , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[12]  Duc Thanh Anh Luong,et al.  Similarity Metrics for SQL Query Clustering , 2018, IEEE Transactions on Knowledge and Data Engineering.

[13]  Shuai Ma,et al.  Detecting inconsistencies in distributed data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[15]  Loreto Bravo,et al.  Efficient Approximation Algorithms for Repairing Inconsistent Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Massimo Panella,et al.  Distributed data clustering over networks , 2019, Pattern Recognit..