Foundations of SPARQL query optimization

We study fundamental aspects related to the efficient processing of the SPARQL query language for RDF, proposed by the W3C to encode machine-readable information in the Semantic Web. Our key contributions are (i) a complete complexity analysis for all operator fragments of the SPARQL query language, which -- as a central result -- shows that the SPARQL operator Optional alone is responsible for the PSpace-completeness of the evaluation problem, (ii) a study of equivalences over SPARQL algebra, including both rewriting rules like filter and projection pushing that are well-known from relational algebra optimization as well as SPARQL-specific rewriting schemes, and (iii) an approach to the semantic optimization of SPARQL queries, built on top of the classical chase algorithm. While studied in the context of a theoretically motivated set semantics, almost all results carry over to the official, bag-based semantics and therefore are of immediate practical relevance.

[1]  Nigel Shadbolt,et al.  Resource Description Framework (RDF) , 2009 .

[2]  Jonathan J. King QUIST: A System for Semantic Query Optimization in Relational Databases , 1981, VLDB.

[3]  Georg Lausen,et al.  An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario , 2008, SEMWEB.

[4]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[5]  Stephen Todd PRTV, an efficient implementation for large relational data bases , 1975, VLDB '75.

[6]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[7]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[8]  Marcelo Arenas,et al.  On the Semantics of SPARQL , 2009, Semantic Web Information Management.

[9]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[10]  Wenfei Fan,et al.  On XML integrity constraints in the presence of DTDs , 2001, PODS '01.

[11]  Georg Gottlob,et al.  Datalog±: a unified approach to ontologies and integrity constraints , 2009, ICDT '09.

[12]  Georg Lausen,et al.  Stop the Chase , 2009, ArXiv.

[13]  Krys J. Kochut,et al.  SPARQLeR: Extended Sparql for Semantic Association Discovery , 2007, ESWC.

[14]  Philip A. Bernstein,et al.  Computational problems related to the design of normal form relational schemas , 1979, TODS.

[15]  Wenfei Fan,et al.  Query Optimization for Semistructured Data Using Path Constraints in a Deterministic Data Model , 1999, DBPL.

[16]  Georg Gottlob,et al.  The complexity of XPath query evaluation , 2003, PODS.

[17]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[18]  Claudio Gutiérrez,et al.  The Expressive Power of SPARQL , 2008, SEMWEB.

[19]  Wenfei Fan,et al.  Integrity constraints for XML , 2000, PODS '00.

[20]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[21]  Ronald Fagin,et al.  Horn clauses and database dependencies , 1982, JACM.

[22]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[23]  John Miles Smith,et al.  Optimizing the performance of a relational algebra database interface , 1975, CACM.

[24]  Alin Deutsch,et al.  Query reformulation with constraints , 2006, SGMD.

[25]  Richard Cyganiak,et al.  A relational algebra for SPARQL , 2005 .

[26]  Axel Polleres,et al.  From SPARQL to rules (and back) , 2007, WWW '07.

[27]  Vassilis Christophides,et al.  Benchmarking RDF Schemas for the Semantic Web , 2002, SEMWEB.

[28]  Shiyong Lu,et al.  Semantics Preserving SPARQL-to-SQL Query Translation for Optional Graph Patterns. Technical Report T , 2006 .

[29]  Ronald Fagin,et al.  Multivalued dependencies and a new normal form for relational databases , 1977, TODS.

[30]  Alin Deutsch,et al.  The chase revisited , 2008, PODS.

[31]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[32]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[33]  Bruno Marnette,et al.  Generalized schema-mappings: from termination to tractability , 2009, PODS.

[34]  David S. Johnson,et al.  Testing containment of conjunctive queries under functional and inclusion dependencies , 1982, J. Comput. Syst. Sci..

[35]  Vassilis Christophides,et al.  On labeling schemes for the semantic web , 2003, WWW '03.

[36]  E. F. Codd Derivability, redundancy and consistency of relations stored in large data banks , 2009, SGMD.

[37]  Alfred V. Aho,et al.  Universality of data retrieval languages , 1979, POPL.

[38]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[39]  Vassilis Christophides,et al.  Containment and Minimization of RDF/S Query Patterns , 2005, SEMWEB.

[40]  E. F. Codd,et al.  Extending the database relational model to capture more meaning , 1979, ACM Trans. Database Syst..

[41]  Jorge Pérez,et al.  Minimal Deductive Systems for RDF , 2007, ESWC.

[42]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[43]  Irving L. Traiger,et al.  The notions of consistency and predicate locks in a database system , 1976, CACM.

[44]  Sergej Sizov What Makes You Think That? The Semantic Web's Proof Layer , 2007, IEEE Intelligent Systems.

[45]  Alin Deutsch,et al.  FOL Modeling of Integrity Constraints (Dependencies) , 2009, Encyclopedia of Database Systems.

[46]  Stefan Decker,et al.  TRIPLE - An RDF Query, Inference, and Transformation Language , 2001, INAP.

[47]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[48]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[49]  Jean-Marie Nicolas First order logic formalization for functional, multivalued and mutual dependencies , 1978, SIGMOD '78.

[50]  Kieron O'Hara,et al.  Trust on the Web: Some Web Science Research Challenges , 2008 .

[51]  John Grant,et al.  Logic-based approach to semantic query optimization , 1990, TODS.

[52]  Daniel J. Abadi,et al.  Using The Barton Libraries Dataset As An RDF benchmark , 2007 .

[53]  Clemens Ley RDFLog: It's like Datalog for RDF , 2008 .

[54]  Georg Gottlob,et al.  Disjunctive datalog , 1997, TODS.

[55]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[56]  David Harel,et al.  Structure and complexity of relational queries , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[57]  Georg Lausen,et al.  Stop the Chase: Short Contribution , 2009, AMW.

[58]  Vassilis Christophides,et al.  RQL: a declarative query language for RDF , 2002, WWW.

[59]  Marcelo Arenas,et al.  nSPARQL: A Navigational Language for RDF , 2008, SEMWEB.

[60]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[61]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[62]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[63]  Volker Linnemann,et al.  Using an index of precomputed joins in order to speed up SPARQL processing , 2007, ICEIS.

[64]  Moshe Y. Vardi,et al.  The Implication Problem for Functional and Inclusion Dependencies is Undecidable , 1985, SIAM J. Comput..

[65]  Michael Kifer,et al.  Logical foundations of object-oriented and frame-based languages , 1995, JACM.

[66]  Martin L. Kersten,et al.  Column-store support for RDF data management: not all swans are white , 2008, Proc. VLDB Endow..

[67]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[68]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[69]  O. Hartig Trustworthiness of Data on the Web , 2008 .

[70]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[71]  Vassilis Christophides,et al.  On Storing Voluminous RDF Descriptions: The Case of Web Portal Catalogs , 2001, WebDB.

[72]  Raphael Volz,et al.  A Comparison of RDF Query Languages , 2004, SEMWEB.

[73]  George H. L. Fletcher,et al.  A role-free approach to indexing large RDF data sets in secondary memory for efficient SPARQL evaluation , 2008, ArXiv.

[74]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[75]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[76]  Vassilis Christophides,et al.  Ieee Transactions on Knowledge and Data Engineering on Graph Features of Semantic Web Schemas , 2022 .

[77]  Larry J. Stockmeyer,et al.  The Polynomial-Time Hierarchy , 1976, Theor. Comput. Sci..

[78]  Jean-François Baget,et al.  Extending SPARQL with regular expression patterns (for querying RDF) , 2009, J. Web Semant..

[79]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[80]  C. J. Date Referential Integrity , 1981, VLDB.

[81]  Tim Furche,et al.  Foundations of Rule-Based Query Answering , 2007, Reasoning Web.

[82]  Irving L. Traiger,et al.  System R: relational approach to database management , 1976, TODS.

[83]  Abraham Bernstein,et al.  OptARQ: A SPARQL Optimization Approach based on Triple Pattern Selectivity Estimation , 2007 .

[84]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[85]  Jim Gray,et al.  The Benchmark Handbook for Database and Transaction Systems , 1993 .

[86]  Donald D. Chamberlin,et al.  SEQUEL: A structured English query language , 1974, SIGFIDET '74.

[87]  Gultekin Özsoyoglu,et al.  A graph query language and its query processing , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[88]  David J. DeWitt,et al.  The oo7 Benchmark , 1993, SIGMOD Conference.

[89]  Eugene Inseok Chong,et al.  An Efficient SQL-based RDF Querying Scheme , 2005, VLDB.

[90]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[91]  Georg Lausen,et al.  On Chase Termination Beyond Stratification , 2009, Proc. VLDB Endow..

[92]  Letizia Tanca,et al.  Semantic Web Information Management - A Model-Based Perspective , 2009, Semantic Web Information Management.

[93]  Georg Gottlob,et al.  The complexity of XPath query evaluation and XML typing , 2005, JACM.

[94]  Ioana Manolescu,et al.  A Benchmark for XML Data Management , 2002 .

[95]  Andrea Calì,et al.  Taming the Infinite Chase: Query Answering under Expressive Relational Constraints , 2008, Description Logics.

[96]  Sven Groppe,et al.  Optimization of SPARQL by using coreSPARQL , 2009, ICEIS.

[97]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[98]  Alberto O. Mendelzon,et al.  Foundations of semantic web databases , 2004, PODS.

[99]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[100]  Catriel Beeri,et al.  A Proof Procedure for Data Dependencies , 1984, JACM.

[101]  Dennis McLeod,et al.  Semantic integrity in a relational data base system , 1975, VLDB '75.

[102]  E. F. Codd,et al.  Relational Completeness of Data Base Sublanguages , 1972, Research Report / RJ / IBM / San Jose, California.

[103]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[104]  Georg Lausen,et al.  SPARQLing constraints for RDF , 2008, EDBT '08.

[105]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[106]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[107]  Alin Deutsch,et al.  XML queries and constraints, containment and reformulation , 2005, Theor. Comput. Sci..

[108]  Arnon Rosenthal,et al.  Outerjoin simplification and reordering for query optimization , 1997, TODS.

[109]  David Maier,et al.  Computing with Logic: Logic Programming with Prolog , 1988 .

[110]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[111]  Vassilis Christophides,et al.  Benchmarking Database Representations of RDF/S Stores , 2005, SEMWEB.

[112]  Alfred V. Aho,et al.  Efficient optimization of a class of relational expressions , 1979, TODS.

[113]  David J. DeWitt,et al.  Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries , 1988, SIGMOD Conference.

[114]  LeeDongwon,et al.  On six degrees of separation in DBLP-DB and more , 2005 .

[115]  Z. Meral Özsoyoglu,et al.  A system for semantic query optimization , 1987, SIGMOD '87.

[116]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2005, Theor. Comput. Sci..

[117]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[118]  Leon Sterling,et al.  The Art of Prolog - Advanced Programming Techniques , 1986 .

[119]  Jeff Heflin,et al.  Rapid Benchmarking for Semantic Web Knowledge Base Systems , 2005, SEMWEB.

[120]  Alta van der Merwe,et al.  A Functional Semantic Web Architecture , 2008, ESWC.

[121]  Olaf Hartig,et al.  The SPARQL Query Graph Model for Query Optimization , 2007, ESWC.

[122]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[123]  David Maier,et al.  Testing implications of data dependencies , 1979, SIGMOD '79.

[124]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[125]  François Scharffe,et al.  SPARQL++ for Mapping Between RDF Vocabularies , 2007, OTM Conferences.

[126]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[127]  Boris Motik,et al.  Adding Integrity Constraints to OWL , 2007, OWLED.