Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search - Extended Technical Report

Indexing the Web of Data offers many opportunities, in particular, to find and explore data sources. One major design decision when indexing the Web of Data is to find a suitable index model, i.e., how to index and summarize data. Various efforts have been conducted to develop specific index models for a given task. With each index model designed, implemented, and evaluated independently, it remains difficult to judge whether an approach generalizes well to another task, set of queries, or dataset. In this work, we empirically evaluate six representative index models with unique feature combinations. Among them is a new index model incorporating inferencing over RDFS and owl:sameAs. We implement all index models for the first time into a single, stream-based framework. We evaluate variations of the index models considering sub-graphs of size 0, 1, and 2 hops on two large, real-world datasets. We evaluate the quality of the indices regarding the compression ratio, summarization ratio, and F1-score denoting the approximation quality of the stream-based index computation. The experiments reveal huge variations in compression ratio, summarization ratio, and approximation quality for different index models, queries, and datasets. However, we observe meaningful correlations in the results that help to determine the right index model for a given task, type of query, and dataset.

[1]  Ioana Manolescu,et al.  Parallel quotient summarization of RDF graphs , 2019, SBD '19.

[2]  Ansgar Scherp,et al.  FLuID: A Meta Model to Flexibly Define Schema-level Indices for the Web of Data , 2019, ArXiv.

[3]  Steffen Staab,et al.  SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data , 2012, J. Web Semant..

[4]  Sonia Bergamaschi,et al.  Exposing the Underlying Schema of LOD Sources , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[5]  Asunción Gómez-Pérez,et al.  Loupe - An Online Tool for Inspecting Datasets in the Linked Data Cloud , 2015, SEMWEB.

[6]  Ansgar Scherp,et al.  LODatio: using a schema-level index to support users infinding relevant sources of linked data , 2013, K-CAP.

[7]  François Goasdoué,et al.  Summarizing semantic graphs: a survey , 2018, The VLDB Journal.

[8]  Ansgar Scherp,et al.  TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud , 2015, ESWC.

[9]  Davide Sangiorgi,et al.  On the origins of bisimulation and coinduction , 2009, TOPL.

[10]  François Goasdoué,et al.  Browsing Linked Data Catalogs with LODAtlas , 2018, International Semantic Web Conference.

[11]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[13]  Ladislav Hluchý,et al.  The SemSets model for ad-hoc semantic list search , 2012, WWW.

[14]  Sebastian Rudolph,et al.  Managing Structured and Semistructured RDF Data Using Structure Indexes , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Gerhard Weikum,et al.  Database Foundations for Scalable RDF Processing , 2011, Reasoning Web.

[16]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[17]  Enrico Motta,et al.  SemSearch: A Search Engine for the Semantic Web , 2006, EKAW.

[18]  Andrea Maurino,et al.  ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization , 2016, SumPre@ESWC.

[19]  Rudi Studer,et al.  Semantic Search - Using Graph-Structured Semantic Models for Supporting the Search Process , 2009, ICCS.

[20]  François Goasdoué,et al.  Incremental structural summarization of RDF graphs , 2019, EDBT.

[21]  Jürgen Umbrich,et al.  Observing Linked Data Dynamics , 2013, ESWC.