Vocab.at - Automatic Linked Data Documentation And Vocabulary Usage Analysis

A growing number of Linked Data is being published as RDF data dumps, as RDFa embedded in HTML pages, and via SPARQL endpoints. Unfortunately, the data available is often poorly documented and the consistency of the datasets is unknown. Gaining an understanding of whether a dataset qualifies for the intended use can then be very time consuming and impede the re-use of the data. When considering quality as fitness of use, documentation is a key component for assessing data quality. The common practice today is to document Linked Data vocabularies that are used by Linked Data. However, this approach neglects documenting the actual vocabulary usage in the datasets. In contrast, this paper presents an novel approach for assessing the vocabulary usage in Linked Data. The method generates missing documentation automatically and complements this by analysing the usage of vocabularies in the datasets. The resulted documentation shows the explicit vocabulary usage, which is invaluable when assessing the consistency and usefulness of the data. This method has been evaluated by developing a web service http://vocab.at and applying the analysis to selected datasets on the web.

[1]  Amit P. Sheth,et al.  Linked Data Is Merely More Data , 2010, AAAI Spring Symposium: Linked Data Meets Artificial Intelligence.

[2]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[3]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[4]  Tomi Kauppinen,et al.  Linked Brazilian Amazon Rainforest Data , 2014, Semantic Web.

[5]  Martin Hepp,et al.  Using SPARQL and SPIN for Data Quality Management on the Semantic Web , 2010, BIS.

[6]  Landong Zuo,et al.  Tracing the provenance of linked data using voiD , 2011, WIMS '11.

[7]  Michael Hausenblas,et al.  Describing Linked Datasets , 2009, LDOW.

[8]  Jeremy J. Carroll,et al.  Matching RDF Graphs , 2002, SEMWEB.

[9]  Olaf Hartig,et al.  Using Web Data Provenance for Quality Assessment , 2009, SWPM.

[10]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[11]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[12]  Yolanda Gil,et al.  PROV Model Primer , 2012 .

[13]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[14]  Martin Hepp,et al.  Swiqa - a semantic web information quality assessment framework , 2011, ECIS.

[15]  Edward G. Schilling,et al.  Juran's Quality Handbook , 1998 .

[16]  Richard Cyganiak,et al.  Neologism: Easy Vocabulary Publishing , 2008 .

[17]  Andreas Harth,et al.  Weaving the Pedantic Web , 2010, LDOW.

[18]  Jürgen Umbrich,et al.  DING! Dataset Ranking using Formal Descriptions , 2009, LDOW.

[19]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[20]  Fabio Vitali,et al.  The Live OWL Documentation Environment: A Tool for the Automatic Generation of Ontology Documentation , 2012, EKAW.

[21]  Aline Senart,et al.  Data Quality Principles in the Semantic Web , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[22]  Wolfram Wöß,et al.  RDFStats - An Extensible RDF Statistics Generator and Library , 2009, 2009 20th International Workshop on Database and Expert Systems Application.

[23]  Krzysztof Janowicz,et al.  Linked Data, Big Data, and the 4th Paradigm , 2013, Semantic Web.

[24]  Jens Lehmann,et al.  LODStats - An Extensible Framework for High-Performance Dataset Analytics , 2012, EKAW.

[25]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.