Qualitative Cleaning of Uncertain Data

We propose a new view on data cleaning: Not data itself but the degrees of uncertainty attributed to data are dirty. Applying possibility theory, tuples are assigned degrees of possibility with which they occur, and constraints are assigned degrees of certainty that say to which tuples they apply. Classical data cleaning modifies some minimal set of tuples. Instead, we marginally reduce their degrees of possibility. This reduction leads to a new qualitative version of the vertex cover problem. Qualitative vertex cover can be mapped to a linear-weighted constraint satisfaction problem. However, any off-the-shelf solver cannot solve the problem more efficiently than classical vertex cover. Instead, we utilize the degrees of possibility and certainty to develop a dedicated algorithm that is fixed parameter tractable in the size of the qualitative vertex cover. Experiments show that our algorithm is faster than solvers for the classical vertex cover problem by several orders of magnitude, and performance improves with higher numbers of uncertainty degrees.

[1]  Sebastian Link,et al.  Possibilistic Functional Dependencies and Their Relationship to Possibility Theory , 2016, IEEE Transactions on Fuzzy Systems.

[2]  Sebastian Link,et al.  Technical Correspondence: “Differential Dependencies: Reasoning and Discovery” Revisited , 2015, TODS.

[3]  Sebastian Link,et al.  Discovering Meaningful Certain Keys from Incomplete and Inconsistent Relations , 2016, IEEE Data Eng. Bull..

[4]  Paolo Papotti,et al.  Descriptive and prescriptive data cleaning , 2014, SIGMOD Conference.

[5]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[6]  Sebastian Link,et al.  Possible and certain keys for SQL , 2016, The VLDB Journal.

[7]  Sebastian Link,et al.  Logical Foundations of Possibilistic Keys , 2014, JELIA.

[8]  Sebastian Link,et al.  On the finite and general implication problems of independence atoms and keys , 2016, J. Comput. Syst. Sci..

[9]  Didier Dubois,et al.  Fuzzy set and possibility theory-based methods in artificial intelligence , 2003, Artif. Intell..

[10]  ZhouXiaofang,et al.  Possible and certain SQL keys , 2015, VLDB 2015.

[11]  Lei Chen,et al.  Differential dependencies: Reasoning and discovery , 2011, TODS.

[12]  Anish Das Sarma,et al.  Data Cleaning: A Practical Perspective , 2013, Data Cleaning: A Practical Perspective.

[13]  Jef Wijsen,et al.  Consistent Query Answering for Primary Keys , 2016, SGMD.

[14]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[15]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[16]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[17]  Paolo Liberatore,et al.  The Complexity of Belief Update , 1997, IJCAI.

[18]  Georg Gottlob,et al.  On the complexity of propositional knowledge base revision, updates, and counterfactuals , 1992, Artif. Intell..

[19]  Eduardo L. Fermé,et al.  Belief Revision , 2007, Inteligencia Artif..

[20]  Didier Dubois,et al.  "Not Impossible" vs. "Guaranteed Possible" in Fusion and Revision , 2001, ECSQARU.

[21]  Martin C. Cooper,et al.  Soft arc consistency revisited , 2010, Artif. Intell..

[22]  Divesh Srivastava,et al.  Combining Quantitative and Logical Data Cleaning , 2015, Proc. VLDB Endow..

[23]  Didier Dubois,et al.  Practical Methods for Constructing Possibility Distributions , 2016, Int. J. Intell. Syst..

[24]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[25]  Michael R. Fellows,et al.  Fundamentals of Parameterized Complexity , 2013 .

[26]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[27]  Lukasz Golab,et al.  On the relative trust between inconsistent data and inaccurate constraints , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Georg Gottlob,et al.  Complexity of Propositional Knowledge Base Revision , 1992, CNKBS.

[29]  Xiang Li,et al.  Cleaning uncertain data for top-k queries , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[30]  Sebastian Link,et al.  SQL Schema Design: Foundations, Normal Forms, and Normalization , 2016, SIGMOD Conference.

[31]  Sebastian Link,et al.  On Independence Atoms and Keys , 2014, CIKM.

[32]  Sebastian Link,et al.  Cardinality constraints on qualitatively uncertain data , 2015, Data Knowl. Eng..

[33]  Gerhard J. Woeginger,et al.  Exact Algorithms for NP-Hard Problems: A Survey , 2001, Combinatorial Optimization.

[34]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.