Structural Properties as Proxy for Semantic Relevance in RDF Graph Sampling

The Linked Data cloud has grown to become the largest knowledge base ever constructed. Its size is now turning into a major bottleneck for many applications. In order to facilitate access to this structured information, this paper proposes an automatic sampling method targeted at maximizing answer coverage for applications using SPARQL querying. The approach presented in this paper is novel: no similar RDF sampling approach exist. Additionally, the concept of creating a sample aimed at maximizing SPARQL answer coverage, is unique. We empirically show that the relevance of triples for sampling (a semantic notion) is influenced by the topology of the graph (purely structural), and can be determined without prior knowledge of the queries. Experiments show a significantly higher recall of topology based sampling methods over random and naive baseline approaches (e.g. up to 90% for Open-BioMed at a sample size of 6%).

[1]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[2]  Thomas Gottron,et al.  A Detailed Analysis of the Quality of Stream-Based Schema Construction on Linked Open Data , 2012, CSWS.

[3]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[4]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[5]  Paul T. Groth,et al.  Measuring the Dynamic Bi-directional Influence between Content and Social Networks , 2010, International Semantic Web Conference.

[6]  Rinke Hoekstra The MetaLex Document Server - Legal Documents as Versioned Linked Data , 2011, International Semantic Web Conference.

[7]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[8]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[9]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[10]  Zhe Wu,et al.  Visualizing large-scale RDF data using Subsets, Summaries, and Sampling in Oracle , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[11]  Paul T. Groth,et al.  Multi-Scale Analysis of the Web of Data: a Challenge to the Complex System's Community , 2011, Adv. Complex Syst..

[12]  Amit P. Sheth,et al.  SemRank: ranking complex relationship search results on the semantic web , 2005, WWW '05.

[13]  Amit P. Sheth,et al.  Discovering and Ranking Semantic Associations over a Large RDF Metabase , 2004, VLDB.

[14]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[15]  Ninghui Sun,et al.  A Parallel Algorithm for Computing Betweenness Centrality , 2009, 2009 International Conference on Parallel Processing.

[16]  Claudio Gutiérrez,et al.  Bipartite Graphs as Intermediate Model for RDF , 2004, SEMWEB.

[17]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[18]  Rinke Hoekstra,et al.  Man vs. Machine: Differences in SPARQL Queries. , 2014, ESWC 2014.

[19]  Jürgen Umbrich,et al.  SPARQL Web-Querying Infrastructure: Ready for Action? , 2013, SEMWEB.

[20]  Siegfried Handschuh,et al.  Recipes for Semantic Web Dog Food - The ESWC and ISWC Metadata Projects , 2007, ISWC/ASWC.

[21]  Giovanni Tummarello,et al.  Introducing RDF Graph Summary with Application to Assisted SPARQL Formulation , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[22]  Rinke Hoekstra,et al.  YASGUI: Not Just Another SPARQL Client , 2013, SALAD@ESWC.

[23]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[24]  Steffen Staab,et al.  SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data , 2012, SEMWEB.

[25]  Karl Aberer,et al.  TRank: Ranking Entity Types Using the Web of Data , 2013, International Semantic Web Conference.

[26]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[27]  Stijn Vansummeren,et al.  What are real SPARQL queries like? , 2011, SWIM '11.

[28]  Peter A. Boncz,et al.  Benchmarking Linked Open Data Management Systems , 2014, ERCIM News.

[29]  Jens Lehmann,et al.  LinkedGeoData: Adding a Spatial Dimension to the Web of Data , 2009, SEMWEB.

[30]  Steffen Staab,et al.  TripleRank: Ranking Semantic Web Data by Tensor Decomposition , 2009, SEMWEB.

[31]  Aidan Hogan,et al.  ReConRank: A Scalable Ranking Method for Semantic Web Data with Context , 2006 .

[32]  Knud Möller,et al.  USEWOD2011: 1st international workshop on usage analysis and the web of data , 2011, WWW.