Explaining Repaired Data with CFDs

Many popular data cleaning approaches are rule-based: Constraints are formulated in a logical framework, and data is considered dirty if constraints are violated. These constraints are often discovered from data, but to ascertain their validity, user verification is necessary. Since the full set of discovered constraints is typically too large for manual inspection, recent research integrates user feedback into the discovery process. We propose a different approach that employs user interaction only at the start of the algorithm: a user manually cleans a small set of dirty tuples, and we infer the constraint underlying those repairs, called an explanation. We make use of conditional functional dependencies (CFDs) as the constraint formalism. We introduce XPlode, an on-demand algorithm which discovers the best explanation for a given repair. Guided by this explanation, data can then be cleaned using state-of-the-art CFD-based cleaning algorithms. Experiments on synthetic and real-world datasets show that the best explanation can typically be inferred using a limited number of modifications. Moreover, XPlode is substantially faster than discovering all CFDs that hold on a dataset, and is robust to noise in the modifications.

[1]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[2]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[3]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[4]  Jilles Vreeken,et al.  Discovering Reliable Approximate Functional Dependencies , 2017, KDD.

[5]  Paolo Papotti,et al.  RuleMiner: Data quality rules discovery , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[6]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[7]  Paolo Papotti,et al.  Interactive and Deterministic Data Cleaning , 2016, SIGMOD Conference.

[8]  Letizia Tanca,et al.  Semi-automatic support for evolving functional dependencies , 2016, EDBT.

[9]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[11]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[12]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[13]  Divesh Srivastava,et al.  Discovering Conservation Rules , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[14]  Paolo Papotti,et al.  Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms , 2015, Proc. VLDB Endow..

[15]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[16]  Jilles Vreeken,et al.  Efficiently Discovering Unexpected Pattern-Co-Occurrences , 2017, SDM.

[17]  Renée J. Miller,et al.  A unified model for data and constraint repair , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[19]  Nicolas Spyratos,et al.  Partition semantics for relations , 1985, PODS '85.

[20]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[21]  Lukasz Golab,et al.  On the relative trust between inconsistent data and inaccurate constraints , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[22]  Paolo Papotti,et al.  That's All Folks! LLUNATIC Goes Open Source , 2014, Proc. VLDB Endow..

[23]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[24]  Bart Goethals,et al.  Cleaning Data with Forbidden Itemsets , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[25]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[26]  Paolo Papotti,et al.  Descriptive and prescriptive data cleaning , 2014, SIGMOD Conference.

[27]  Francesco Bonchi,et al.  Pushing Tougher Constraints in Frequent Pattern Mining , 2005, PAKDD.

[28]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[29]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[30]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  Lei Chen,et al.  Discovering matching dependencies , 2009, CIKM.

[32]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[33]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[34]  Sam Madden,et al.  Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion , 2016 .

[35]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[36]  Mourad Ouzzani,et al.  UGuide: User-Guided Discovery of FD-Detectable Errors , 2017, SIGMOD Conference.

[37]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.