On Lightweight Data Summaries for Optimised Query Processing over Linked Data

Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. This time-consuming pre-processing phase however leverages the benefits of Linked Data – where structured data is accessible live and up-to-date at distributed Web resources that may change constantly – only to a limited degree, as query results can never be up-to-date. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the Web exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking “for free”.

[1]  Evaggelia Pitoura,et al.  On Using Histograms as Routing Indexes in Peer-to-Peer Systems , 2004, DBISP2P.

[2]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[3]  Yuzhong Qu,et al.  Searching Linked Objects with Falcons: Approach, Implementation and Evaluation , 2009, Int. J. Semantic Web Inf. Syst..

[4]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[5]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[6]  Katja Hose,et al.  Distributed Data Summaries for Approximate Query Processing in PDMS , 2006, 2006 10th International Database Engineering and Applications Symposium (IDEAS'06).

[7]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[8]  Salvatore Orlando,et al.  Tree vector indexes: efficient range queries for dynamic content on peer-to-peer networks , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[9]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[10]  Eyal Oren,et al.  Sindice.com: a document-oriented lookup index for open linked data , 2008, Int. J. Metadata Semant. Ontologies.

[11]  Enrico Motta,et al.  Characterizing Knowledge on the Semantic Web with Watson , 2007, EON.

[12]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[13]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[14]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[15]  Evaggelia Pitoura,et al.  On Constructing Small Worlds in Unstructured Peer-to-Peer Systems , 2004, EDBT Workshops.

[16]  Heiner Stuckenschmidt,et al.  Index structures and algorithms for querying distributed RDF repositories , 2004, WWW '04.

[17]  Katja Hose,et al.  Maintenance strategies for routing indexes , 2009, Distributed and Parallel Databases.

[18]  Katja Hose,et al.  Processing Rank-Aware Queries in P2P Systems , 2005, DBISP2P.

[19]  Jürgen Umbrich,et al.  Towards a scalable search and query engine for the web , 2007, WWW '07.

[20]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[21]  Hector Garcia-Molina,et al.  Routing indices for peer-to-peer systems , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[22]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[23]  Wolfram Wöß,et al.  RDFStats - An Extensible RDF Statistics Generator and Library , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[24]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[25]  Christian Bizer,et al.  Executing SPARQL Queries over the Web of Linked Data , 2009, SEMWEB.