SCODED: Statistical Constraint Oriented Data Error Detection

Statistical Constraints (SCs) play an important role in statistical modeling and analysis. This paper brings the concept to data cleaning and studies how to leverage SCs for error detection. SCs provide a novel approach that has various application scenarios and works harmoniously with downstream statistical modeling. Entailment relationships between SCs and integrity constraints provide analytical insight into SCs. We develop SCODED, an SC-Oriented Data Error Detection system, comprising two key components: (1) SC Violation Detection : checks whether an SC is violated on a given dataset, and (2) Error Drill Down : identifies the top-k records that contribute most to the violation of an SC. Experiments on synthetic and real-world data show that SCs are effective in detecting data errors that violate them, compared to state-of-the-art approaches.

[1]  Jeff G. Schneider,et al.  Anomaly pattern detection in categorical datasets , 2008, KDD.

[2]  Sanjay Krishnan,et al.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[3]  Brian Macdonald A Regression-Based Adjusted Plus-Minus Statistic for NHL Players , 2010, 1006.4310.

[4]  Sam Madden,et al.  Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion , 2016 .

[5]  Tim Kraska,et al.  SampleClean: Fast and Reliable Analytics on Dirty Data , 2015, IEEE Data Eng. Bull..

[6]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems - Exact Computational Methods for Bayesian Networks , 1999, Information Science and Statistics.

[7]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[8]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[9]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[10]  Yan Liu,et al.  Medical data mining: insights from winning two competitions , 2010, Data Mining and Knowledge Discovery.

[11]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[12]  R. Nelsen,et al.  On the relationship between Spearman's rho and Kendall's tau for pairs of continuous random variables , 2007 .

[13]  Jilles Vreeken,et al.  Discovering Reliable Approximate Functional Dependencies , 2017, KDD.

[14]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[15]  Dan Suciu,et al.  Towards correcting input data errors probabilistically using integrity constraints , 2006, MobiDE '06.

[16]  Michael Stonebraker,et al.  Raha: A Configuration-Free Error Detection System , 2019, SIGMOD Conference.

[17]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[18]  Dan Suciu,et al.  Bias in OLAP Queries: Detection, Explanation, and Removal , 2018, SIGMOD Conference.

[19]  A. Dawid Conditional Independence in Statistical Theory , 1979 .

[20]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[21]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[22]  Dan Geiger,et al.  d-Separation: From Theorems to Algorithms , 2013, UAI.

[23]  Dan Suciu,et al.  Capuchin: Causal Database Repair for Algorithmic Fairness , 2019, ArXiv.

[24]  Yeye He,et al.  Auto-Detect: Data-Driven Error Detection in Tables , 2018, SIGMOD Conference.

[25]  Marc Gyssens,et al.  On the conditional independence implication problem: A lattice-theoretic approach , 2008, Artif. Intell..

[26]  Dan Suciu,et al.  HypDB: A Demonstration of Detecting, Explaining and Resolving Bias in OLAP queries , 2018, Proc. VLDB Endow..

[27]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[29]  Chao Li,et al.  Model Trees for Identifying Exceptional Players in the NHL and NBA Drafts , 2018, MLSA@PKDD/ECML.

[30]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[31]  Catherine Dehon,et al.  Influence functions of the Spearman and Kendall correlation measures , 2010, Stat. Methods Appl..

[32]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[33]  Dan Suciu,et al.  Integrity Constraints Revisited: From Exact to Approximate Implication , 2018, ICDT.

[34]  Ronald Fagin,et al.  Multivalued dependencies and a new normal form for relational databases , 1977, TODS.

[35]  Felix Naumann,et al.  DynFD: Functional Dependency Discovery in Dynamic Datasets , 2019, EDBT.

[36]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[37]  Oliver Schulte,et al.  Model-Based Outlier Detection for Object-Relational Data , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[38]  Milan Studeny,et al.  Conditional independence relations have no finite complete characterization , 1992 .

[39]  J. Pearl,et al.  Logical and Algorithmic Properties of Conditional Independence and Graphical Models , 1993 .

[40]  D. Margaritis Learning Bayesian Network Model Structure from Data , 2003 .

[41]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.

[42]  Luiz Eduardo Soares de Oliveira,et al.  Adapting dynamic classifier selection for concept drift , 2018, Expert Syst. Appl..

[43]  Felix Naumann,et al.  Data Profiling , 2018, Data Profiling.

[44]  Dan Suciu,et al.  A formal approach to finding explanations for database queries , 2014, SIGMOD Conference.

[45]  Eugene Wu,et al.  QFix: Diagnosing Errors through Query Histories , 2016, SIGMOD Conference.

[46]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[47]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[48]  D. C. Howell Statistical Methods for Psychology , 1987 .

[49]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[50]  S. Sullivant Gaussian conditional independence relations have no finite complete characterization , 2007, 0704.2847.

[51]  Yeye He,et al.  Uni-Detect: A Unified Approach to Automated Error Detection in Tables , 2019, SIGMOD Conference.

[52]  Tova Milo,et al.  Query-Oriented Data Cleaning with Oracles , 2015, SIGMOD Conference.

[53]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[54]  Yeung Sam Hung,et al.  A comparative analysis of Spearman's rho and Kendall's tau in normal and contaminated normal models , 2013, Signal Process..

[55]  David Maxwell Chickering,et al.  Finding Optimal Bayesian Networks , 2002, UAI.

[56]  Felix Naumann,et al.  Detecting unique column combinations on dynamic data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[57]  W. Knight A Computer Method for Calculating Kendall's Tau with Ungrouped Data , 1966 .

[58]  Gustavo Alonso,et al.  Declarative Support for Sensor Data Cleaning , 2006, Pervasive.

[59]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[60]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[61]  Dan Wu,et al.  On the implication problem for probabilistic conditional independency , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[62]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[63]  Dan Suciu,et al.  Interventional Fairness: Causal Database Repair for Algorithmic Fairness , 2019, SIGMOD Conference.

[64]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[65]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.