Dependable Data Repairing with Fixing Rules

One of the main challenges that data-cleaning systems face is to automatically identify and repair data errors in a dependable manner. Though data dependencies (also known as integrity constraints) have been widely studied to capture errors in data, automated and dependable data repairing on these errors has remained a notoriously difficult problem. In this work, we introduce an automated approach for dependably repairing data errors, based on a novel class of fixing rules. A fixing rule contains an evidence pattern, a set of negative patterns, and a fact value. The heart of fixing rules is deterministic: given a tuple, the evidence pattern and the negative patterns of a fixing rule are combined to precisely capture which attribute is wrong, and the fact indicates how to correct this error. We study several fundamental problems associated with fixing rules and establish their complexity. We develop efficient algorithms to check whether a set of fixing rules are consistent and discuss approaches to resolve inconsistent fixing rules. We also devise efficient algorithms for repairing data errors using fixing rules. Moreover, we discuss approaches on how to generate a large number of fixing rules from examples or available knowledge bases. We experimentally demonstrate that our techniques outperform other automated algorithms in terms of the accuracy of repairing data errors, using both real-life and synthetic data.

[1]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[2]  Shuai Ma,et al.  Extending Dependencies with Conditions , 2007, VLDB.

[3]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[4]  Divesh Srivastava,et al.  Discovering Conservation Rules , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Lukasz Golab,et al.  Sampling the repairs of functional dependency violations under hard constraints , 2010, Proc. VLDB Endow..

[6]  Robert L. Surowka Modeling and querying possible repairs in duplicate detection , 2010 .

[7]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[8]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[9]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[11]  Renée J. Miller,et al.  A unified model for data and constraint repair , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Nan Tang,et al.  Proof positive and negative in data cleaning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[14]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[15]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[16]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[17]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[18]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[19]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[20]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[21]  Hong Cheng,et al.  Repairing Vertex Labels under Neighborhood Constraints , 2014, Proc. VLDB Endow..

[22]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[23]  Sumit Gulwani,et al.  Learning Semantic String Transformations from Examples , 2012, Proc. VLDB Endow..

[24]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[25]  Xiang Lian,et al.  Consistent query answers in inconsistent probabilistic databases , 2010, SIGMOD Conference.

[26]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[27]  Avishek Saha,et al.  Metric Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[28]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[29]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[30]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[31]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[32]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[33]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[34]  Lei Chen,et al.  Efficient discovery of similarity constraints for matching dependencies , 2013, Data Knowl. Eng..

[35]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[36]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[37]  Lukasz Golab,et al.  On the relative trust between inconsistent data and inaccurate constraints , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[38]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[39]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[40]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[41]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[42]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[43]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[44]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..