An Efficient and Robust Approach for Discovering Data Quality Rules

Poor quality data is a growing problem that affects many enterprises across all aspects of their business ranging from operational efficiency to revenue protection. Moreover, this problem is costly to fix because significant effort and resources are required to identify a comprehensive set of rules that can detect (and correct) data defects along various data quality dimensions such as consistency, conformity, and more. Hence, many organizations employ only basic data quality rules that check for null values, format, etc. in efforts such as data profiling and data cleansing; and ignore rules that are needed to detect deeper problems such as inconsistent values across interdependent attributes. This oversight can lead to numerous problems such as inaccurate reporting of key metrics used to inform critical decisions or derive business insights. In this paper, we present an approach that efficiently and robustly discovers data quality rules -- in particular conditional functional dependencies -- for detecting inconsistencies in data and hence improves data quality along the critical dimension of consistency. We evaluate our approach empirically on several real-world data sets. We show that our approach performs well on these data sets for metrics such as precision and recall. We also compare our approach to an established solution and show that our approach outperforms this solution for the same metrics. Finally, we show that our approach scales efficiently with the number of records, the number of attributes, and the domain size.

[1]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[2]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[3]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[4]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[5]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[6]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[7]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[8]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.