Robust hybrid name disambiguation framework for large databases

In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is a non-trivial task in data management that aims to properly distinguish different entities which share the same name, particularly for large databases like digital libraries, as only limited information can be used to identify authors’ name. In digital libraries, ambiguous author names occur due to the existence of multiple authors with the same name or different name variations for the same person. Also known as name disambiguation, most of the previous works to solve this issue often employ hierarchical clustering approaches based on information inside the citation records, e.g. co-authors and publication titles. In this paper, we focus on proposing a robust hybrid name disambiguation framework that is not only applicable for digital libraries but also can be easily extended to other application based on different data sources. We propose a web pages genre identification component to identify the genre of a web page, e.g. whether the page is a personal homepage. In addition, we propose a re-clustering model based on multidimensional scaling that can further improve the performance of name disambiguation. We evaluated our approach on known corpora, and the favorable experiment results indicated that our proposed framework is feasible.

[1]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[2]  john maccoll,et al.  ACM / IEEE Joint Conference on Digital Libraries , 2001 .

[3]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[4]  Shazia Wasim Sadiq,et al.  Sampling dirty data for matching attributes , 2010, SIGMOD Conference.

[5]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[6]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[7]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[8]  Alistair Kennedy,et al.  Automatic Identification of Home Pages on the Web , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[9]  Amit P. Sheth,et al.  Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection , 2006, WWW '06.

[10]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[11]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[12]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[13]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[14]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[15]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[17]  Jiang Wu,et al.  Author name disambiguation in scientific collaboration and mobility cases , 2013, Scientometrics.

[18]  Alberto J. Cañas,et al.  Using WordNet for Word Sense Disambiguation to Support Concept Map Construction , 2003, SPIRE.

[19]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[20]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Michael C. Fairhurst,et al.  Classifier Ensemble Generation for the Majority Vote Rule , 2008, CIARP.

[22]  Xiaofang Zhou,et al.  Enhance Web Pages Genre Identification Using Neighboring Pages , 2011, WISE.

[23]  Xiaofang Zhou,et al.  Efficient web pages identification for entity resolution , 2010, WWW '10.

[24]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[25]  Jan-Ming Ho,et al.  Author Name Disambiguation for Citations Using Topic and Web Correlation , 2008, ECDL.

[26]  Xiaofang Zhou,et al.  A Term-Based Driven Clustering Approach for Name Disambiguation , 2009, APWeb/WAIM.

[27]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[28]  魏屹东,et al.  Scientometrics , 2018, Encyclopedia of Big Data.