Mining CFD Rules on Big Data

Current conditional functional dependencies (CFDs) discovery algorithms always need a well-prepared training data set. This makes them difficult to be applied on large datasets which are always in low-quality. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. For the low-quality issue of big data, we then design the fault-tolerant rule discovery algorithm and the conflict resolution algorithm. We also propose parameter selection strategy for CFD discovery algorithm to ensure its effectiveness. Experimental results demonstrate that our method could discover effective CFD rules on billion-tuple data within reasonable time.

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[3]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[4]  Leopoldo E. Bertossi,et al.  Consistent query answering in databases , 2006, SGMD.

[5]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[6]  Shuai Ma,et al.  Extending Dependencies with Conditions , 2007, VLDB.

[7]  Jan Chomicki,et al.  Consistent Query Answering: Five Easy Pieces , 2007, ICDT.

[8]  Aravind Kalavagattu MINING APPROXIMATE FUNCTIONAL DEPENDENCIES AS CONDENSED REPRESENTATIONS OF ASSOCIATION RULES , 2008 .

[9]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[10]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[11]  Shuai Ma,et al.  Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[13]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[14]  Wenguang Chen,et al.  Analyses and Validation of Conditional Dependencies with Built-in Predicates , 2009, DEXA.

[15]  Xi Zhang,et al.  Estimating the confidence of conditional functional dependencies , 2009, SIGMOD Conference.

[16]  Floris Geerts,et al.  Discovering Conditional Functional Dependencies , 2011, IEEE Transactions on Knowledge and Data Engineering.