SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data

We present SchemEX, an approach and tool for a stream-based indexing and schema extraction of Linked Open Data (LOD) at web-scale. The schema index provided by SchemEX can be used to locate distributed data sources in the LOD cloud. It serves typical LOD information needs such as finding sources that contain instances of one specific data type, of a given set of data types (so-called type clusters), or of instances in type clusters that are connected by one or more common properties (so-called equivalence classes). The entire process of extracting the schema from triples and constructing an index is designed to have linear runtime complexity. Thus, the schema index can be computed on-the-fly while the triples are crawled and provided as a stream by a linked data spider. To demonstrate the web-scalability of our approach, we have computed a SchemEX index over the Billion Triples Challenge (BTC) dataset 2011 consisting of 2,170 million triples. In addition, we have computed the SchemEX index on a dataset with 11 million triples. We use this smaller dataset for conducting a detailed qualitative analysis. We are capable of locating relevant data sources with recall between 71% and 98% and a precision between 74% and 100% at a window size of 100 K triples observed in the stream and depending on the complexity of the query, i.e. if one wants to find specific data types, type clusters or equivalence classes.

[1]  Felix Naumann,et al.  Creating voiD descriptions for Web-scale data , 2011, J. Web Semant..

[2]  Mariano P. Consens,et al.  Exploring RDF Usage and Interlinking in the Linked Open Data Cloud using ExpLOD , 2010, LDOW.

[3]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[4]  Rudi Studer,et al.  Semantic Search - Using Graph-Structured Semantic Models for Supporting the Search Process , 2009, ICCS.

[5]  J. Widom,et al.  Approximate DataGuides , 1998 .

[6]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[7]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[8]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[9]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[10]  Mariano P. Consens,et al.  ExpLOD: Summary-Based Exploration of Interlinking and RDF Usage in the Linked Open Data Cloud , 2010, ESWC.

[11]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[12]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[13]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[15]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[16]  Eero Hyvönen,et al.  Publishing and Using Cultural Heritage Linked Data on the SemanticWeb.In: A Publication in the Morgan & Claypool Publishers series, SYNTHESIS LECTURES ON SEMANTIC WEB: THEORY AND TECHNOLOGY , 2012 .