A Preliminary Investigation Towards Improving Linked Data Quality Using Distance-Based Outlier Detection

With more and more data being published on the Web as Linked Data, Web Data quality is becoming increasingly important. While quite some work has been done with regard to quality assessment of Linked Data, only few works have addressed quality improvement. In this article, we present a preliminary an approach for identifying potentially incorrect RDF statements using distance-based outlier detection. Our method follows a three stage approach, which automates the whole process of finding potentially incorrect statements for a certain property. Our preliminary evaluation shows that a high precision is maintained with different settings.

[1]  Heiko Paulheim,et al.  Improving the Quality of Linked Data Using Statistical Distributions , 2014, Int. J. Semantic Web Inf. Syst..

[2]  Maribel Acosta,et al.  Crowdsourcing Linked Data Quality Assessment , 2013, SEMWEB.

[3]  Sylvie Ranwez,et al.  Semantic Measures for the Comparison of Units of Language, Concepts or Entities from Text and Knowledge Base Analysis , 2013, ArXiv.

[4]  Heiko Paulheim,et al.  Detecting Incorrect Numerical Data in DBpedia , 2014, ESWC.

[5]  Nicola J. Mulder,et al.  A Topology-Based Metric for Measuring Term Similarity in the Gene Ontology , 2012, Adv. Bioinformatics.

[6]  Christoph Lange,et al.  Quality Assessment of Linked Datasets Using Probabilistic Approximation , 2015, ESWC.

[7]  Junzhong Gu,et al.  A New Model of Information Content for Semantic Similarity in WordNet , 2008, 2008 Second International Conference on Future Generation Communication and Networking Symposia.

[8]  Jens Lehmann,et al.  User-driven quality evaluation of DBpedia , 2013, I-SEMANTICS '13.

[9]  Harald Sack,et al.  WhoKnows? Evaluating linked data heuristics with a quiz that cleans up DBpedia , 2011, Interact. Technol. Smart Educ..

[10]  Christoph Lange,et al.  Luzzu -- A Framework for Linked Data Quality Assessment , 2016, 2016 IEEE Tenth International Conference on Semantic Computing (ICSC).

[11]  Harald Sack,et al.  DBpedia ontology enrichment for inconsistency detection , 2012, I-SEMANTICS '12.

[12]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.