Not Quite the Same: Identity Constraints for the Web of Linked Data

Linked Data is based on the idea that information from different sources can flexibly be connected to enable novel applications that individual datasets do not support on their own. This hinges upon the existence of links between datasets that would otherwise be isolated. The most notable form, sameAs links, are intended to express that two identifiers are equivalent in all respects. Unfortunately, many existing ones do not reflect such genuine identity. This study provides a novel method to analyse this phenomenon, based on a thorough theoretical analysis, as well as a novel graph-based method to resolve such issues to some extent. Our experiments on a representative Web-scale set of sameAs links from the Web of Data show that our method can identify and remove hundreds of thousands of constraint violations.

[1]  Deborah L. McGuinness,et al.  owl:sameAs and Linked Data: An Empirical Study , 2010 .

[2]  Gerhard Weikum,et al.  Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.

[3]  Deborah L. McGuinness,et al.  When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[4]  Gerhard Weikum,et al.  Language as a Foundation of the Semantic Web , 2008, SEMWEB.

[5]  François Scharffe,et al.  Final results of the ontology alignment evaluation initiative 2011 , 2011 .

[6]  Jürgen Umbrich,et al.  Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora , 2012, J. Web Semant..

[7]  E. Rosch,et al.  Cognition and Categorization , 1980 .

[8]  Deborah L. McGuinness,et al.  SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl: sameAs in Linked Data , 2010, International Semantic Web Conference.

[9]  L. Barsalou,et al.  Ad hoc categories , 1983, Memory & cognition.

[10]  Yuval Rabani,et al.  ON THE HARDNESS OF APPROXIMATING MULTICUT AND SPARSEST-CUT , 2005, 20th Annual IEEE Conference on Computational Complexity (CCC'05).

[11]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[12]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[13]  Gerhard Weikum,et al.  MENTA: inducing multilingual taxonomies from wikipedia , 2010, CIKM '10.

[14]  Mihalis Yannakakis,et al.  Approximate Max-Flow Min-(Multi)Cut Theorems and Their Applications , 1996, SIAM J. Comput..

[15]  K. Swartz,et al.  Selective attention and the processing of integral and nonintegral dimensions: A developmental study , 1976 .

[16]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[17]  Karl Aberer,et al.  idMesh: graph-based disambiguation of linked data , 2009, WWW '09.

[18]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[19]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[20]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[21]  A. Tversky Features of Similarity , 1977 .

[22]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[23]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative , 2007 .