Entity Disambiguation with Textual and Connection Information

Abstract Entity disambiguation is the task to resolve the underlying entity with the same surface form in the data. It arises from information integration, document retrieval, web search and many other applications. Based on the fact that entity occurring in most of the real world data possess both the textual information and the interobject relationship, we propose an unsupervised iterative similarity propagation algorithm to disambiguate entities. We first choose the entity pairs with the same surface form as the probable matching candidates, and construct a connection graph which take these probable matching pairs as nodes and built edges with the interobject relationship. Because the more similar textual information the two records in one probable pair possess, the greater possibility the two records correspond to the same real world entity. We use the textual similarity score as the initial value for our iterative method. Then the similarity of each entity pair is propagated based on the connection graph constructed. When the iteration is terminated, we identify the pairs whose final similarity scores are larger than a given threshold as the real match. The new method is applied to disambiguate authors in publication records. Experimental results on the real DBLP digital library data set demonstrate the effectiveness.