Towards an Efficient RDF Dataset Slicing

Over the last years, a considerable amount of structured data has been published on the Web as Linked Open Data (LOD). Despite recent advances, consuming and using Linked Open Data within an organization is still a substantial challenge. Many of the LOD datasets are quite large and despite progress in Resource Description Framework (RDF) data management their loading and querying within a triple store is extremely time-consuming and resource-demanding. To overcome this consumption obstacle, we propose a process inspired by the classical Extract-Transform-Load (ETL) paradigm. In this article, we focus particularly on the selection and extraction steps of this process. We devise a fragment of SPARQL Protocol and RDF Query Language (SPARQL) dubbed SliceSPARQL, which enables the selection of well-defined slices of datasets fulfilling typical information needs. SliceSPARQL supports graph patterns for which each connected subgraph pattern involves a maximum of one variable or Internationalized resource identifier (IRI) in its join conditions. This restriction guarantees the efficient processing of the query against a sequential dataset dump stream. Furthermore, we evaluate our slicing approach on three different optimization strategies. Results show that dataset slices can be generated an order of magnitude faster than by using the conventional approach of loading the whole dataset into a triple store.

[1]  Axel-Cyrille Ngonga Ngomo,et al.  A time-efficient hybrid approach to link discovery , 2011, OM.

[2]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[3]  Axel-Cyrille Ngonga Ngomo,et al.  Active Learning of Domain-Specific Distances for Link Discovery , 2012, JIST.

[4]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[5]  Haofen Wang,et al.  Zhishi.links results for OAEI 2011 , 2011, OM.

[6]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[7]  Axel-Cyrille Ngonga Ngomo,et al.  Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures , 2012, SEMWEB.

[8]  Alexandre Passant,et al.  sparqlPuSH: Proactive Notification of Data Updates in RDF Stores Using PubSubHubbub , 2010, SFSW.

[9]  Jürgen Umbrich,et al.  LDspider: An Open-source Crawling Framework for the Web of Linked Data , 2010, SEMWEB.

[10]  Martin Gaedke,et al.  Silk - A Link Discovery Framework for the Web of Data , 2009, LDOW.

[11]  David Holmes,et al.  Java Concurrency in Practice , 2006 .

[12]  Andre Bolles,et al.  Streaming SPARQL - Extending SPARQL to Process Data Streams , 2008, ESWC.

[13]  Sören Auer,et al.  Large-Scale RDF Dataset Slicing , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[14]  Christian Bizer,et al.  Executing SPARQL Queries over the Web of Linked Data , 2009, SEMWEB.

[15]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[16]  Danh Le Phuoc,et al.  A Native and Adaptive Approach for Unified Processing of Linked Streams and Linked Data , 2011, SEMWEB.

[17]  Alasdair J. G. Gray,et al.  Enabling Ontology-Based Access to Streaming Data Sources , 2010, SEMWEB.

[18]  Daniele Braga,et al.  An execution environment for C-SPARQL queries , 2010, EDBT '10.

[19]  Giovanni Tummarello,et al.  RDFSync: Efficient Remote Synchronization of RDF Models , 2007, ISWC/ASWC.

[20]  Sebastian Rudolph,et al.  EP-SPARQL: a unified language for event processing and stream reasoning , 2011, WWW.

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Jürgen Umbrich,et al.  MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data , 2006, SEMWEB.

[23]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[24]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.