Completeness and soundness guarantees for conjunctive SPARQL queries over RDF data sources with completeness statements

RDF generally follows the open-world assumption: information is incomplete by default. Consequently, SPARQL queries cannot retrieve with certainty complete answers, and even worse, when they involve negation, it is unclear whether they produce sound answers. Nevertheless, there is hope to lift this limitation. On many specific topics (e.g., children of Trump, Apollo 11 crew, EU founders), RDF data sources contain complete information, a fact that can be made explicit through completeness statements. In this work, we leverage completeness statements over RDF data sources to provide guarantees of completeness and soundness for conjunctive SPARQL queries. We develop a technique to check whether query completeness can be guaranteed by taking into account also the specifics of the queried graph, and analyze the complexity of such checking. For queries with negation, we approach the problem of query soundness checking, and distinguish between answer soundness (i.e., is an answer of a query sound?) and pattern soundness (i.e., is a query as a whole sound?). We provide a formalization and characterize the soundness problem via a reduction to the completeness problem. We further develop heuristic techniques for completeness checking, and conduct experimental evaluations based on Wikidata, a prominent, real-world knowledge base, to demonstrate the feasibility of our approach.

[1]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[2]  Jens Lehmann,et al.  LODStats: The Data Web Census Dataset , 2016, SEMWEB.

[3]  Werner Nutt,et al.  Completeness Management for RDF Data Sources , 2018, ACM Trans. Web.

[4]  Werner Nutt,et al.  Completeness of queries over incomplete databases , 2011, Proc. VLDB Endow..

[5]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[6]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[7]  Amihai Motro,et al.  Not all answers are equally good: estimating the quality of database answers , 1997 .

[8]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[9]  Jürgen Umbrich,et al.  RDFS and OWL Reasoning for Linked Data , 2013, Reasoning Web.

[10]  Michael Gelfond,et al.  Classical negation in logic programs and disjunctive databases , 1991, New Generation Computing.

[11]  Raymond Reiter,et al.  Towards a Logical Reconstruction of Relational Database Theory , 1982, On Conceptual Modelling.

[12]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[13]  Amihai Motro,et al.  Integrity = validity + completeness , 1989, TODS.

[14]  Werner Nutt,et al.  Managing and Consuming Completeness Information for Wikidata Using COOL-WD , 2016, COLD@ISWC.

[15]  Keith L. Clark,et al.  Negation as Failure , 1987, Logic and Data Bases.

[16]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[17]  Gerd Wagner,et al.  Extended RDF as a Semantic Foundation of Rule Markup Languages , 2008, J. Artif. Intell. Res..

[18]  Werner Nutt,et al.  Expressing No-Value Information in RDF , 2015, International Semantic Web Conference.

[19]  Carsten Lutz,et al.  Ontology-Based Data Access with Closed Predicates is Inherently Intractable(Sometimes) , 2013, IJCAI.

[20]  Simon Razniewski,et al.  Cardinal Virtues: Extracting Relation Cardinalities from Text , 2017, ACL.

[21]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[22]  Maribel Acosta,et al.  HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing , 2015, K-CAP.

[23]  Martin Hepp,et al.  Swiqa - a semantic web information quality assessment framework , 2011, ECIS.

[24]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[25]  Aidan Hogan,et al.  Skolemising Blank Nodes while Preserving Isomorphism , 2015, WWW.

[26]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[27]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[28]  Alon Y. Levy Obtaining Complete Answers from Incomplete Databases , 1996, VLDB 1996.

[29]  Werner Nutt,et al.  But What Do We Actually Know? , 2016, AKBC@NAACL-HLT.

[30]  Jorge Pérez,et al.  Simple and Efficient Minimal RDFS , 2009, J. Web Semant..

[31]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[32]  Werner Nutt,et al.  Identifying the Extent of Completeness of Query Answers over Partially Complete Databases , 2015, SIGMOD Conference.

[33]  Simon Razniewski,et al.  Predicting Completeness in Knowledge Bases , 2016, WSDM.

[34]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[35]  Axel Polleres,et al.  Everything you always wanted to know about blank nodes , 2014, J. Web Semant..

[36]  R. Reiter On Closed World Data Bases , 1987, Logic and Data Bases.

[37]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[38]  Andreas Harth,et al.  Rules with Contextually Scoped Negation , 2006, ESWC.

[39]  Axel Polleres,et al.  Certain Answers for SPARQL? , 2016, AMW.

[40]  Werner Nutt,et al.  Expanding Wikidata's Parenthood Information by 178%, or How To Mine Relation Cardinality Information , 2016, SEMWEB.

[41]  Boris Motik,et al.  Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation , 2018, WWW.

[42]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[43]  Michael Günther,et al.  Introducing Wikidata to the Linked Data Web , 2014, SEMWEB.

[44]  Nick Koudas,et al.  The design of a query monitoring system , 2009, TODS.

[45]  Werner Nutt,et al.  Bridging the Semantic Gap between RDF and SPARQL using Completeness Statements , 2014, International Semantic Web Conference.

[46]  E. F. Codd,et al.  Relational Completeness of Data Base Sublanguages , 1972, Research Report / RJ / IBM / San Jose, California.

[47]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[48]  Rinke Hoekstra,et al.  Man vs. Machine: Differences in SPARQL Queries. , 2014, ESWC 2014.

[49]  Manolis Koubarakis,et al.  Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks , 2006, SEMWEB.

[50]  Yufei Tao,et al.  On the Hardness and Approximation of Euclidean DBSCAN , 2017, ACM Trans. Database Syst..

[51]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[52]  Werner Nutt,et al.  Completeness Statements about RDF Data Sources and Their Use for Query Answering , 2013, SEMWEB.

[53]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[54]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[55]  Werner Nutt,et al.  Enabling Fine-Grained RDF Data Completeness Assessment , 2016, ICWE.

[56]  Marcelo Arenas,et al.  Querying semantic web data with SPARQL , 2011, PODS.

[57]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[58]  Anthony C. Klug On conjunctive queries containing inequalities , 1988, JACM.

[59]  Günter Ladwig,et al.  FedBench: A Benchmark Suite for Federated Semantic Data Query Processing , 2011, SEMWEB.