Quality assessment for Linked Data: A Survey

The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this article, we present the results of a systematic review of approaches for assessing the quality of LD. We gather existing approaches and analyze them qualitatively. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of 18 quality dimensions and 69 metrics. Additionally, we qualitatively analyze the 30 core approaches and 12 tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.

[1]  Saeedeh Shekarpour,et al.  Modeling and evaluation of trust with an extension in semantic web , 2010, J. Web Semant..

[2]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[3]  Andriy Nikolov,et al.  Detecting Quality Problems in Semantic Metadata without the Presence of a Gold Standard , 2007, EON.

[4]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[5]  Jürgen Umbrich,et al.  An empirical survey of Linked Data conformance , 2012, J. Web Semant..

[6]  Jens Lehmann,et al.  User-driven quality evaluation of DBpedia , 2013, I-SEMANTICS '13.

[7]  Declan O'Sullivan,et al.  Improving Curated Web-Data Quality with Structured Harvesting and Assessment , 2014, Int. J. Semantic Web Inf. Syst..

[8]  Maria-Esther Vidal,et al.  Analyzing Linked Data Quality with LiQuate , 2013, OTM Workshops.

[9]  Yolanda Gil,et al.  Trusting Information Sources One Citizen at a Time , 2002, SEMWEB.

[10]  Axel Polleres,et al.  Robust and scalable Linked Data reasoning incorporating provenance and trust annotations , 2011, J. Web Semant..

[11]  Olaf Hartig,et al.  Using Web Data Provenance for Quality Assessment , 2009, SWPM.

[12]  M. Jarke,et al.  Fundamentals of Data Warehouses , 2003, Springer Berlin Heidelberg.

[13]  O. Hartig Trustworthiness of Data on the Web , 2008 .

[14]  Jens Lehmann,et al.  DBpedia and the live extraction of structured data from Wikipedia , 2012, Program.

[15]  Christian Bizer,et al.  Quality-Driven Information Filtering- In the Context of Web-Based Information Systems , 2007 .

[16]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[17]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[18]  Jens Lehmann,et al.  I18n of Semantic Web Applications , 2010, SEMWEB.

[19]  Lalana Kagal,et al.  Rule-Based Trust Assessment on the Semantic Web , 2011, RuleML Europe.

[20]  Denny Vrandecic,et al.  Ontology Evaluation , 2009, Handbook on Ontologies.

[21]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[22]  Christian Bizer,et al.  Quality-driven information filtering using the WIQA policy framework , 2009, J. Web Semant..

[23]  Steffen Stadtmüller,et al.  On the Diversity and Availability of Temporal Information in Linked Open Data , 2012, SEMWEB.

[24]  Ross Horne,et al.  Tracing where and who provenance in Linked Data: A calculus , 2012, Theor. Comput. Sci..

[25]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[26]  James A. Hendler,et al.  Trust Networks on the Semantic Web , 2003, WWW.

[27]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[28]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[29]  Jens Lehmann,et al.  Assessing Linked Data Mappings Using Network Measures , 2012, ESWC.

[30]  Enrico Motta,et al.  A framework for evaluating semantic metadata , 2007, K-CAP '07.

[31]  Asunción Gómez-Pérez,et al.  Assessing linkset quality for complementing third-party datasets , 2013, EDBT '13.

[32]  Yolanda Gil,et al.  Towards content trust of web resources , 2006, WWW '06.

[33]  Andreas Harth,et al.  Weaving the Pedantic Web , 2010, LDOW.

[34]  Jeremy J. Carroll,et al.  Signing RDF Graphs , 2003, SEMWEB.

[35]  Jens Lehmann,et al.  Hybrid Acquisition of Temporal Scopes for RDF Data , 2014, ESWC.

[36]  Nilson Arrais Quality control handbook , 1966 .

[37]  Jens Lehmann,et al.  Test-driven evaluation of linked data quality , 2014, WWW.

[38]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement , 2009, BMJ.

[39]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[40]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[41]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[42]  Holger Lewen Facilitating ontology reuse using user-based ontology evaluation , 2010 .

[43]  Andrea Maurino,et al.  Capturing the Age of Linked Open Data: Towards a Dataset-Independent Framework , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[44]  Geoffrey Edwards,et al.  An ontology-based method for quality assessment of spatial data bases , 2004 .

[45]  Jennifer Golbeck,et al.  Using Trust and Provenance for Content Filtering on the Semantic Web , 2006, MTW.

[46]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[47]  Heiko Paulheim,et al.  Detecting Incorrect Numerical Data in DBpedia , 2014, ESWC.

[48]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[49]  Felix Naumann,et al.  Profiling linked open data with ProLOD , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[50]  Li Ding,et al.  Characterizing the Semantic Web on the Web , 2006, SEMWEB.

[51]  Diana Maynard,et al.  Metrics for Evaluation of Ontology-based Information Extraction , 2006, EON@WWW.

[52]  Martin Hepp,et al.  Swiqa - a semantic web information quality assessment framework , 2011, ECIS.

[53]  Anisa Rula,et al.  Methodology for Assessment of Linked Data Quality , 2014, LDQ@SEMANTICS.

[54]  Heiko Paulheim,et al.  Improving the Quality of Linked Data Using Statistical Distributions , 2014, Int. J. Semantic Web Inf. Syst..

[55]  Maribel Acosta,et al.  Crowdsourcing Linked Data Quality Assessment , 2013, SEMWEB.

[56]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[57]  Deborah L. McGuinness,et al.  When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[58]  R. P. Srivastava,et al.  A conceptual framework and belief‐function approach to assessing overall information quality , 2003, Int. J. Intell. Syst..

[59]  Carole A. Goble,et al.  Quality, trust, and utility of scientific data on the web: towards a joint model , 2011, WebSci '11.

[60]  Ping Chen,et al.  Hypothesis generation and data quality assessment through association mining , 2010, 9th IEEE International Conference on Cognitive Informatics (ICCI'10).

[61]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .