The Complexity of Database Inconsistency Measures

Managing data inconsistency has been one of the major challenges in the research and practice of database management. Database inconsistency arises for different reasons and in different applications. Nowadays, many applications obtain information from imprecise sources (e.g., social networks) via imprecise procedures (e.g., natural-language processing). Inconsistency may also arise when integrating conflicting data from different sources. During the past two decades, researchers have established, developed and investigated a principled approach to managing database inconsistency via the notion of database repairs. A repair of an inconsistent database is traditionally defined as a consistent database that differs from the inconsistent one in a “minimal” way. We investigate various problems arising in the challenge of measuring how inconsistent a database is. The problem of measuring inconsistency has been studied extensively by the Knowledge Representation and Logic communities, and has been recently acknowledged by the database community. Inconsistency measures are important for estimating the extent to which a database is trustworthy, and the effort required to clean it. Specifically, we explore the computational complexity of two basic inconsistency measures. The first measure is based on the cost of a minimal repair (i.e., the minimal number of operations required to obtain consistency), and the second is based on the number of repairs. We focus on data complexity (where the schema is considered fixed and the input consists of a database instance) and establish dichotomies in (i.e., a full classification of) data complexity for the entire space of sets of functional dependencies. Finally, repairs are often not equally legitimate, as it is desired to prefer one over another; for example, one tuple is regarded more reliable than another, or a more recent tuple should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs that incorporates preferences among database tuples. We revisit the second measure in the presence of preferences among tuples in the database. We show that the presence of preferences significantly affects the computational complexity.

[1]  Venkatesan Guruswami,et al.  Inapproximability Results for Set Splitting and Satisfiability Problems with No Mixed Clauses , 2004, Algorithmica.

[2]  Y. Narahari,et al.  A Shapley Value-Based Approach to Discover Influential Nodes in Social Networks , 2011, IEEE Transactions on Automation Science and Engineering.

[3]  Ronald Fagin,et al.  Inclusion dependencies and their interaction with functional dependencies , 1982, PODS.

[4]  Luca Trevisan,et al.  Constraint satisfaction: the approximability of minimization problems , 1997, Proceedings of Computational Complexity. Twelfth Annual IEEE Conference.

[5]  Phokion G. Kolaitis,et al.  Repair checking in inconsistent databases: algorithms and complexity , 2009, ICDT '09.

[6]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[7]  Jan Chomicki,et al.  Prioritized repairing and consistent query answering in relational databases , 2012, Annals of Mathematics and Artificial Intelligence.

[8]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[9]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[10]  Marco Calautti,et al.  Counting Database Repairs under Primary Keys Revisited , 2019, SEBD.

[11]  Leopoldo E. Bertossi,et al.  Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics , 2006, ICDT.

[12]  Frederick Reiss,et al.  Cleaning inconsistencies in information extraction via prioritized repairs , 2014, PODS.

[13]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[14]  Anthony Hunter,et al.  Measuring Inconsistency through Minimal Inconsistent Sets , 2008, KR.

[15]  David Maier Minimum Covers in Relational Database Model , 1980, JACM.

[16]  Georg Gottlob,et al.  Identifying the Minimal Transversals of a Hypergraph and Related Problems , 1995, SIAM J. Comput..

[17]  Leopoldo E. Bertossi,et al.  Database Repairing and Consistent Query Answering , 2011, Database Repairing and Consistent Query Answering.

[18]  Anthony Hunter,et al.  On the measure of conflicts: Shapley Inconsistency Values , 2010, Artif. Intell..

[19]  Wenfei Fan,et al.  Determining the relative accuracy of attributes , 2013, SIGMOD '13.

[20]  Craig Boutilier,et al.  Learning Mallows Models with Pairwise Preferences , 2011, ICML.

[21]  Emanuel Sallinger,et al.  Combined Complexity of Repair Checking and Consistent Query Answering , 2014, AMW.

[22]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Christos H. Papadimitriou,et al.  Approximately dominating representatives , 2005, Theor. Comput. Sci..

[24]  Sergio Greco,et al.  Preferred repairs for inconsistent databases , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[25]  Jef Wijsen,et al.  Determining the Currency of Data , 2011, TODS.

[26]  Parke Godfrey,et al.  An overview of cooperative answering , 1992, Journal of Intelligent Information Systems.

[27]  Catriel Beeri,et al.  Formal Systems for Tuple and Equality Generating Dependencies , 1984, SIAM J. Comput..

[28]  Ihab F. Ilyas,et al.  Principles of Progress Indicators for Database Repairing , 2019, ArXiv.

[29]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.

[30]  Benny Kimelfeld,et al.  A dichotomy in the complexity of deletion propagation with functional dependencies , 2012, PODS '12.

[31]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[32]  G. Zaccour,et al.  Time-consistent Shapley value allocation of pollution cost reduction , 1999 .

[33]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[34]  Leopoldo E. Bertossi,et al.  The Shapley Value of Tuples in Query Answering , 2019, ICDT.

[35]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[36]  Sergio Greco,et al.  Feasibility Conditions and Preference Criteria in Querying and Repairing Inconsistent Databases , 2004, DEXA.

[37]  WijsenJef,et al.  Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints , 2017 .

[38]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[39]  John Grant,et al.  Measuring Consistency Gain and Information Loss in Stepwise Inconsistency Resolution , 2011, ECSQARU.

[40]  John Grant,et al.  Analysing inconsistent information using distance-based measures , 2017, Int. J. Approx. Reason..

[41]  Jérôme Lang,et al.  Quantifying information and contradiction in propositional logic through test actions , 2003, IJCAI.

[42]  Jianfeng Du,et al.  Weight-based consistent query answering over inconsistent SHIQ knowledge bases , .

[43]  Derek G. Corneil,et al.  Complement reducible graphs , 1981, Discret. Appl. Math..

[44]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[45]  Dirk Vermeir,et al.  Preferred Answer Sets for Ordered Logic Programs , 2002, JELIA.

[46]  Matthias Thimm,et al.  On the Compliance of Rationality Postulates for Inconsistency Measures: A More or Less Complete Picture , 2017, KI - Künstliche Intelligenz.

[47]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[48]  Mitsunori Ogihara,et al.  Counting Classes are at Least as Hard as the Polynomial-Time Hierarchy , 1992, SIAM J. Comput..

[49]  Vladimir Gurvich,et al.  An Efficient Incremental Algorithm for Generating All Maximal Independent Sets in Hypergraphs of Bounded Dimension , 2000, Parallel Process. Lett..

[50]  C. J. Date Referential Integrity , 1981, VLDB.

[51]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[52]  Faruk Gul Bargaining Foundations of Shapley Value , 1989 .

[53]  Salil P. Vadhan,et al.  The Complexity of Counting in Sparse, Regular, and Planar Graphs , 2002, SIAM J. Comput..

[54]  John Grant,et al.  Measuring inconsistency in knowledgebases , 2006, Journal of Intelligent Information Systems.

[55]  Benny Kimelfeld,et al.  Detecting Ambiguity in Prioritized Database Repairing , 2017, ICDT.

[56]  Ahmed K. Elmagarmid,et al.  NADEEF: A Generalized Data Cleaning System , 2013, Proc. VLDB Endow..

[57]  Jef Wijsen,et al.  Database repairing using updates , 2005, TODS.

[58]  Christopher De Sa,et al.  A Formal Framework For Probabilistic Unclean Databases , 2018, ICDT.

[59]  Yi Mao,et al.  Non-parametric Modeling of Partially Ranked Data , 2007, NIPS.

[60]  Phokion G. Kolaitis,et al.  A dichotomy in the complexity of consistent query answering for queries with two atoms , 2012, Inf. Process. Lett..

[61]  Ronald Fagin,et al.  Expressive Power of Entity-Linking Frameworks , 2019, ICDT.

[62]  Eric Vigoda,et al.  A polynomial-time approximation algorithm for the permanent of a matrix with non-negative entries , 2001, STOC '01.

[63]  John Grant,et al.  Inconsistency Measures for Relational Databases , 2019, ArXiv.

[64]  Tatiana Nenova,et al.  The value of corporate voting rights and control: A cross-country analysis , 2003 .

[65]  Paolo Papotti,et al.  Estimating Data Integration and Cleaning Effort , 2015, EDBT.

[66]  Viggo Kann,et al.  Some APX-completeness results for cubic graphs , 2000, Theor. Comput. Sci..

[67]  Guy Van den Broeck,et al.  The most probable database problem , 2014 .

[68]  Reuven Bar-Yehuda,et al.  A Linear-Time Approximation Algorithm for the Weighted Vertex Cover Problem , 1981, J. Algorithms.

[69]  Vishal Misra,et al.  Internet Economics: The Use of Shapley Value for ISP Settlement , 2007, IEEE/ACM Transactions on Networking.

[70]  Ingo Wegener,et al.  Complexity theory - exploring the limits of efficient algorithms , 2005 .

[71]  Dan Suciu,et al.  A Dichotomy on the Complexity of Consistent Query Answering for Atoms with Simple Keys , 2012, ICDT.

[72]  Batya Kenig,et al.  The Complexity of the Possible Winner Problem over Partitioned Preferences , 2018, AAMAS.

[73]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[74]  Jef Wijsen,et al.  A dichotomy in the complexity of counting database repairs , 2013, J. Comput. Syst. Sci..

[75]  Yehoshua Sagiv,et al.  Full disjunctions: polynomial-delay iterators in action , 2006, VLDB.

[76]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[77]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[78]  V. S. Subrahmanian,et al.  Inconsistency Management Policies , 2008, KR.

[79]  Jérôme Monnot,et al.  Optimal Reallocation under Additive and Ordinal Preferences , 2016, AAMAS.

[80]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[81]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[82]  Leopoldo E. Bertossi,et al.  Repair-Based Degrees of Database Inconsistency , 2019, LPNMR.

[83]  Jan Chomicki,et al.  Preference-Driven Querying of Inconsistent Relational Databases , 2006, EDBT Workshops.

[84]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[85]  L. Shapley A Value for n-person Games , 1988 .

[86]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[87]  E. F. Codd,et al.  Recent Investigations in Relational Data Base Systems , 1974, ACM Pacific.

[88]  Zhenliang Liao,et al.  Case study on initial allocation of Shanghai carbon emission trading based on Shapley value , 2015 .

[89]  Amihai Motro,et al.  Utility-based resolution of data inconsistencies , 2004, IQIS '04.

[90]  Stéphane Pérennes,et al.  Hardness and Approximation of Traffic Grooming , 2007, ISAAC.

[91]  Benny Kimelfeld,et al.  Computing Optimal Repairs for Functional Dependencies , 2017, PODS.

[92]  Craig Boutilier,et al.  Effective sampling and learning for mallows models with pairwise-preference data , 2014, J. Mach. Learn. Res..

[93]  Leopoldo E. Bertossi,et al.  The Causality/Repair Connection in Databases: Causality-Programs , 2017, SUM.

[94]  Phokion G. Kolaitis,et al.  Exchange-Repairs , 2014, Journal on Data Semantics.

[95]  Jef Wijsen,et al.  Counting Database Repairs that Satisfy Conjunctive Queries with Self-Joins , 2019, ICDT.

[96]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[97]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[98]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[99]  Lukasz Golab,et al.  Sampling the repairs of functional dependency violations under hard constraints , 2010, Proc. VLDB Endow..

[100]  Ronald Fagin,et al.  Dichotomies in the Complexity of Preferred Repairs , 2015, PODS.

[101]  Benny Kimelfeld,et al.  Counting and Enumerating (Preferred) Database Repairs , 2017, PODS.