Effective Pruning for the Discovery of Conditional Functional Dependencies

Conditional functional dependencies (CFDs) have been proposed as a new type of semantic rules extended from traditional functional dependencies. They have shown great potential for detecting and repairing inconsistent data. Constant CFDs are 100% confidence association rules. The theoretical search space for the minimal set of CFDs is the set of minimal generators and their closures in data. This search space has been used in the currently most efficient constant CFD discovery algorithm. In this paper, we propose pruning criteria to further prune the theoretic search space, and design a fast algorithm for constant CFD discovery. We evaluate the proposed algorithm on a number of media to large real-world data sets. The proposed algorithm is faster than the currently most efficient constant CFD discovery algorithm, and has linear time performance in the size of a data set.

[1]  Jiuyong Li,et al.  On optimal rule discovery , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[3]  Nimrod Megiddo,et al.  Discovering Predictive Association Rules , 1998, KDD.

[4]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[5]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[6]  Mario F. Triola,et al.  Biostatistics for the Biological and Health Sciences , 2005 .

[7]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[8]  Philippe Lenca,et al.  On Optimal Rule Mining: A Framework and a Necessary and Sufficient Condition of Antimonotonicity , 2009, PAKDD.

[9]  Gerd Stumme,et al.  Generating a Condensed Representation for Association Rules , 2005, Journal of Intelligent Information Systems.

[10]  Wenfei Fan,et al.  Conditional Dependencies: A Principled Approach to Improving Data Quality , 2009, BNCOD.

[11]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[12]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[13]  José L. Balcázar,et al.  Transforming Outermost into Context-Sensitive Rewriting , 2010, Log. Methods Comput. Sci..

[14]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[15]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[17]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[18]  Jinyan Li,et al.  Relative risk and odds ratio: a data mining perspective , 2005, PODS '05.

[19]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[21]  Engelbert Mephu Nguifo,et al.  Frequent closed itemset based algorithms: a thorough structural and analytical survey , 2006, SKDD.

[22]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[23]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[24]  Mohammed J. Zaki Mining Non-Redundant Association Rules , 2004, Data Min. Knowl. Discov..