Towards certain fixes with editing rules and master data

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are guaranteed correct, and worse still, may even introduce new errors when attempting to repair the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We also develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. Furthermore, we present a framework and an algorithm to find certain fixes, by interacting with the users to ensure that one of the certain regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.

[1]  Shuai Ma,et al.  Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Salil P. Vadhan,et al.  Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[3]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[4]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[5]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[6]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[7]  Jennifer Widom,et al.  Active Database Systems: Triggers and Rules For Advanced Database Processing , 1994 .

[8]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[9]  Ran Raz,et al.  A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP , 1997, STOC '97.

[10]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[11]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[12]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[13]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[14]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[15]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[16]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[17]  L. Venkata Subramaniam,et al.  Data cleansing as a transient service , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[19]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[20]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[21]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[22]  Wenguang Chen,et al.  Analyses and Validation of Conditional Dependencies with Built-in Predicates , 2009, DEXA.

[23]  Shuai Ma,et al.  Extending Dependencies with Conditions , 2007, VLDB.

[24]  Tatsuya Akutsu,et al.  Approximating Minimum Keys and Optimal Substructure Screens , 1996, COCOON.

[25]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[26]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[27]  Joseph M. Hellerstein,et al.  USHER: Improving data quality with dynamic forms , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[29]  Jianzhong Li,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Dynamic Constraints for Record Matching , 2022 .

[30]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[31]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[34]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[35]  Lei Chen,et al.  Discovering matching dependencies , 2009, CIKM.

[36]  Philip Giles,et al.  A model for generalized edit and imputation of survey data , 1988 .

[37]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[38]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[39]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[40]  Kazuhisa Makino,et al.  New Algorithms for Enumerating All Maximal Cliques , 2004, SWAT.

[41]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[42]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, Proc. VLDB Endow..

[43]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.