Improving Data Quality: Consistency and Accuracy

Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D' that satisfies the constraints and "minimally" differs from D. Equally important is to ensure that the automatically-generated repair D' is accurate, or makes sense, i.e., D' differs from the "correct" data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.

[1]  Leopoldo E. Bertossi,et al.  Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics , 2006, ICDT.

[2]  Pierre Marquis,et al.  DISTANCE-SAT: Complexity and Algorithms , 1999, AAAI/IAAI.

[3]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[4]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[5]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[6]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[7]  Michael J. Maher Constrained Dependencies , 1995, Theor. Comput. Sci..

[8]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[10]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[11]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[12]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[13]  Loreto Bravo,et al.  Efficient Approximation Algorithms for Repairing Inconsistent Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Jaikumar Radhakrishnan,et al.  Greed is good: Approximating independent sets in sparse and bounded-degree graphs , 1997, Algorithmica.

[15]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1990, BIT.

[16]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[17]  Timos K. Sellis,et al.  ARKTOS: towards the modeling, design, control and execution of ETL processes , 2001, Inf. Syst..

[18]  Paul De Bra,et al.  Conditional Dependencies for Horizontal Decompositions , 1983, ICALP.

[19]  Marianne Baudinet,et al.  Constraint-Generating Dependencies , 1994, PPCP.

[20]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[21]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[22]  Francesco Scarcello,et al.  Census Data Repair: a Challenging Application of Disjunctive Logic Programming , 2001, LPAR.

[23]  Jef Wijsen,et al.  Condensed Representation of Database Repairs for Consistent Query Answering , 2003, ICDT.

[24]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1992, BIT Comput. Sci. Sect..

[25]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[26]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[27]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[28]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[29]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[30]  Michael J. Maher,et al.  Chasing constrained tuple-generating dependencies , 1996, PODS.

[31]  William E. Winkler,et al.  Methods for evaluating and creating data quality , 2004, Inf. Syst..

[32]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[33]  Antonio Sassano,et al.  Errors Detection and Correction in Large Scale Data Collecting , 2001, IDA.