Using Link Features for Entity Clustering in Knowledge Graphs

Knowledge graphs holistically integrate information about entities from multiple sources. A key step in the construction and maintenance of knowledge graphs is the clustering of equivalent entities from different sources. Previous approaches for such an entity clustering suffer from several problems, e.g., the creation of overlapping clusters or the inclusion of several entities from the same source within clusters. We therefore propose a new entity clustering algorithm CLIP that can be applied both to create entity clusters and to repair entity clusters determined with another clustering scheme. In contrast to previous approaches, CLIP not only uses the similarity between entities for clustering but also further features of entity links such as the so-called link strength. To achieve a good scalability we provide a parallel implementation of CLIP based on Apache Flink. Our evaluation for different datasets shows that the new approach can achieve substantially higher cluster quality than previous approaches.

[1]  Erhard Rahm,et al.  Management and Analysis of Big Graph Data: Current Systems and Open Challenges , 2017, Handbook of Big Data Technologies.

[2]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[3]  Emanuel Santos,et al.  To repair or not to repair: reconciling correctness and coherence in ontology reference alignments , 2013, OM.

[4]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[5]  Daniela Rus,et al.  Journal of Graph Algorithms and Applications the Star Clustering Algorithm for Static and Dynamic Information Organization , 2022 .

[6]  Qing Wang,et al.  A Clustering-Based Framework for Incrementally Repairing Entity Resolution , 2016, PAKDD.

[7]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[8]  Alieh Saeedi,et al.  Interactive Visualization of Large Similarity Graphs and Entity Resolution Clusters , 2018, EDBT.

[9]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[10]  Markus Nentwig,et al.  Holistic Entity Clustering for Linked Data , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[11]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[12]  Wolfgang Nejdl,et al.  Unsupervised Link Discovery Through Knowledge Base Repair , 2014 .

[13]  Andrea Calì,et al.  A Framework for Representing Ontology Mappings under Probabilities and Inconsistency , 2007, URSW.

[14]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Renée J. Miller,et al.  Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.

[17]  Alieh Saeedi,et al.  Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution , 2017, ADBIS.

[18]  Erhard Rahm The Case for Holistic Data Integration , 2016, ADBIS.

[19]  Heiko Paulheim,et al.  Knowledge graph refinement: A survey of approaches and evaluation methods , 2016, Semantic Web.

[20]  Ravi Kumar,et al.  Correlation clustering in MapReduce , 2014, KDD.