Information-theoretic Analysis of Entity Dynamics on the Linked Open Data Cloud

The Linked Open Data (LOD) cloud is expanding continuously. Entities appear, change, and disappear over time. However, relatively little is known about the dynamics of the entities, i. e., the characteristics of their temporal evolution. In this paper, we employ clustering techniques over the dynamics of entities to determine common temporal patterns. We define an entity as RDF resource together with its attached RDF types and properties. The quality of the clusterings is evaluated using entity features such as the entities’ properties, RDF types, and pay-level domain. In addition, we investigate to what extend entities that share a feature value change together over time. As dataset, we use weekly LOD snapshots over a period of more than three years provided by the Dynamic Linked Data Observatory. Insights into the dynamics of entities on the LOD cloud has strong practical implications to any application requiring fresh caches of LOD. The range of applications is from determining crawling strategies for LOD, caching SPARQL queries, to programming against LOD, and recommending vocabularies for reusing LOD vocabularies.

[1]  Jürgen Umbrich,et al.  Towards a Dynamic Linked Data Observatory , 2012 .

[2]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[3]  Jürgen Umbrich,et al.  Observing Linked Data Dynamics , 2013, ESWC.

[4]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[5]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[6]  Michael Martin,et al.  Improving the Performance of Semantic Web Applications with SPARQL Query Caching , 2010, ESWC.

[7]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.

[8]  Thomas Risse,et al.  Named entity evolution analysis on wikipedia , 2014, WebSci '14.

[9]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[10]  Ansgar Scherp,et al.  Temporal Patterns and Periodicity of Entity Dynamics in the Linked Open Data Cloud , 2015, K-CAP.

[11]  Bernhard Haslhofer,et al.  DSNotify - A solution for event detection and link maintenance in dynamic datasets , 2011, J. Web Semant..

[12]  Jürgen Umbrich,et al.  Towards Understanding the Changing Web: Mining the Dynamics of Linked-Data Sources and Entities , 2010, LWA.

[13]  Walid G. Aref,et al.  Periodicity detection in time series databases , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Xiaoxin Yin,et al.  Building taxonomy of web search intents for name entity queries , 2010, WWW '10.

[15]  Jürgen Umbrich,et al.  Hybrid SPARQL Queries: Fresh vs. Fast Results , 2012, SEMWEB.

[16]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[17]  Li Ding,et al.  Characterizing the Semantic Web on the Web , 2006, SEMWEB.

[18]  Ansgar Scherp,et al.  Strategies for Efficiently Keeping Local Linked Open Data Caches Up-To-Date , 2015, International Semantic Web Conference.

[19]  Thomas Gottron,et al.  Perplexity of Index Models over Evolving Linked Data , 2014, ESWC.

[20]  Steffen Staab,et al.  SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data , 2012, J. Web Semant..

[21]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[22]  Gerd Gröner,et al.  Change-a-LOD: Does the Schema on the Linked Data Cloud Change or Not? , 2013, COLD.

[23]  Gerd Gröner,et al.  From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources , 2014, PROFILES@ESWC.

[24]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[25]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[26]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.