Using Semantic Web Resources for Data Quality Management

The quality of data is a critical factor for all kinds of decision-making and transaction processing. While there has been a lot of research on data quality in the past two decades, the topic has not yet received sufficient attention from the Semantic Web community. In this paper, we discuss (1) the data quality issues related to the growing amount of data available on the Semantic Web, (2) how data quality problems can be handled within the Semantic Web technology framework, namely using SPARQL on RDF representations, and (3) how Semantic Web reference data, e.g. from DBPedia, can be used to spot incorrect literal values and functional dependency violations. We show how this approach can be used for data quality management of public Semantic Web data and data stored in relational databases in closed settings alike. As part of our work, we developed generic SPARQL queries to identify (1) missing datatype properties or literal values, (2) illegal values, and (3) functional dependency violations. We argue that using Semantic Web datasets reduces the effort for data quality management substantially. As a use-case, we employ Geonames, a publicly available Semantic Web resource for geographical data, as a trusted reference for managing the quality of other data sources.

[1]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[2]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[3]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[4]  ShethAmit,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, VLDB 1996.

[5]  Andriy Nikolov,et al.  Detecting Quality Problems in Semantic Metadata without the Presence of a Gold Standard , 2007, EON.

[6]  Anneli Folkesson,et al.  World Wide Web Consortium (W3C) , 2005 .

[7]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[8]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[9]  Martin Hepp,et al.  Using SPARQL and SPIN for Data Quality Management on the Semantic Web , 2010, BIS.

[10]  Helena Galhardas,et al.  A Taxonomy of Data Quality Problems , 2005 .

[11]  Felix Naumann,et al.  Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen , 2006 .

[12]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[13]  Stefan Brüggemann,et al.  Using Ontologies Providing Domain Knowledge for Data Quality Management , 2009, Networked Knowledge - Networked Media - Integrating Knowledge Management.

[14]  Pedro Rangel Henriques,et al.  A Formal Definition of Data Quality Problems , 2005, ICIQ.

[15]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[16]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[17]  Olaf Hartig,et al.  Using Web Data Provenance for Quality Assessment , 2009, SWPM.

[18]  Jack E. Olson,et al.  Data Quality: The Accuracy Dimension , 2003 .

[19]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[20]  Christian Bizer,et al.  Quality-driven information filtering using the WIQA policy framework , 2009, J. Web Semant..

[21]  Vipul Kashyap,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, The VLDB Journal.

[22]  Olaf Hartig,et al.  Querying Trust in RDF Data with tSPARQL , 2009, ESWC.

[23]  Thomas C. Redman,et al.  Data Quality: The Field Guide , 2001 .

[24]  Janusz Kacprzyk,et al.  Networked Knowledge - Networked Media , 2009 .

[25]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..