Change-Aware Scheduling for Effectively Updating Linked Open Data Caches

The linked open data (LOD) cloud is a global information space with a wealth of structured facts, which are useful for a wide range of usage scenarios. The LOD cloud handles a large number of requests from applications consuming the data. However, the performance of retrieving data from LOD repositories is one of the major challenge. Overcome with this challenge, we argue that it is advantageous to maintain a local cache for efficient querying and processing. Due to the continuous evolution of the LOD cloud, local copies become outdated. In order to utilize the best resources, improvised scheduling is required to maintain the freshness of the local data cache. In this paper, we have proposed an approach to efficiently capture the changes and update the cache. Our proposed approach, called application-aware change prioritization (AACP), consists of a change metric that quantifies the changes in LOD, and a weight function that assigns importance to recent changes. We have also proposed a mechanism to update policies, called preference-aware source update (PASU), which incorporates the previous estimation of changes and establishes when the local data cache needs to be updated. In the experimental evaluation, several state-of-the-art strategies are compared against the proposed approach. The performance of each policy is measured by computing the precision and recall between the local data cache update using the policy under consideration and the data source, which is the ground truth. Both cases of a single update and iterative update are evaluated in this study. The proposed approach is reported to outperform all the other policies by achieving an F1-score of 88% and effectivity of 93.5%.

[1]  Kjetil Kjernsmo A Survey of HTTP Caching Implementations on the Open Semantic Web , 2015, ESWC.

[2]  Mor Harchol-Balter,et al.  Size-based scheduling to improve web performance , 2003, TOCS.

[3]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[4]  Lei Zou,et al.  Semantic SPARQL Similarity Search Over RDF Knowledge Graphs , 2016, Proc. VLDB Endow..

[5]  Thomas Gottron,et al.  Perplexity of Index Models over Evolving Linked Data , 2014, ESWC.

[6]  Jürgen Umbrich,et al.  Observing Linked Data Dynamics , 2013, ESWC.

[7]  Steffen Stadtmüller,et al.  On the Diversity and Availability of Temporal Information in Linked Open Data , 2012, SEMWEB.

[8]  Prasenjit Mitra,et al.  Clustering-based incremental web crawling , 2010, TOIS.

[9]  Gerd Gröner,et al.  Which of the following SPARQL Queries are Similar? Why? , 2013, LD4IE@ISWC.

[10]  Ansgar Scherp,et al.  Keeping linked open data caches up-to-date by predicting the life-time of RDF triples , 2017, WI.

[11]  Jürgen Umbrich,et al.  MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data , 2006, SEMWEB.

[12]  Michael Martin,et al.  Improving the Performance of Semantic Web Applications with SPARQL Query Caching , 2010, ESWC.

[13]  Harald Sack,et al.  Scheduling Refresh Queries for Keeping Results from a SPARQL Endpoint Up-to-Date (Extended Version) , 2016, ArXiv.

[14]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[15]  Radu Stoica,et al.  Identifying hot and cold data in main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[16]  Divesh Srivastava,et al.  Forward Decay: A Practical Time Decay Model for Streaming Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Sungyoung Lee,et al.  Evaluating scheduling strategies in LOD based application , 2017, 2017 19th Asia-Pacific Network Operations and Management Symposium (APNOMS).

[20]  Jürgen Umbrich,et al.  Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources , 2010, LDOW.

[21]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[22]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[23]  Jürgen Umbrich,et al.  Towards a Dynamic Linked Data Observatory , 2012 .

[24]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[25]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[26]  Iman Keivanloo,et al.  A Linked Data platform for mining software repositories , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[27]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[28]  László Böszörményi,et al.  A survey of Web cache replacement strategies , 2003, CSUR.

[29]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[30]  Malte Knauf,et al.  Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage , 2014, Distributed and Parallel Databases.

[31]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[32]  Ansgar Scherp,et al.  Temporal Patterns and Periodicity of Entity Dynamics in the Linked Open Data Cloud , 2015, K-CAP.

[33]  Jürgen Umbrich,et al.  Hybrid SPARQL Queries: Fresh vs. Fast Results , 2012, SEMWEB.

[34]  Thomas Gottron Measuring the Accuracy of Linked Data Indices , 2016, ArXiv.

[35]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[36]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[37]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[38]  Tim Berners-Lee,et al.  Information Management: A Proposal , 1990 .

[39]  Jürgen Umbrich,et al.  Evaluating query and storage strategies for RDF archives , 2019, Semantic Web.

[40]  Harald Sack,et al.  Scheduling Refresh Queries for Keeping Results from a SPARQL Endpoint Up-to-Date (Short Paper) , 2016, OTM Conferences.

[41]  Bernhard Haslhofer,et al.  DSNotify: handling broken links in the web of data , 2010, WWW '10.

[42]  Philipp Frischmuth,et al.  Weaving a Social Data Web with Semantic Pingback , 2010, EKAW.

[43]  Steffen Staab,et al.  SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data , 2012, J. Web Semant..

[44]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[45]  Felix Naumann,et al.  Caching and Prefetching Strategies for SPARQL Queries , 2013, ESWC.

[46]  Ansgar Scherp,et al.  Strategies for Efficiently Keeping Local Linked Open Data Caches Up-To-Date , 2015, International Semantic Web Conference.

[47]  Thomas Gottron,et al.  An Investigation of HTTP Header Information for Detecting Changes of Linked Open Data Sources , 2014, ESWC.

[48]  Gerd Gröner,et al.  Change-a-LOD: Does the Schema on the Linked Data Cloud Change or Not? , 2013, COLD.