On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.

[1]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[2]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[3]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[4]  Jean-Raymond Abrial,et al.  On B , 1998, B.

[5]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[6]  Seungwoo Lee,et al.  Construction of a large-scale test set for author disambiguation , 2011, Inf. Process. Manag..

[7]  Adriano Veloso,et al.  Effective self-training author name disambiguation in scholarly digital libraries , 2010, JCDL '10.

[8]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[9]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[11]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[12]  Christopher Joseph Pal,et al.  Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[13]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[14]  Marcos André Gonçalves,et al.  An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations , 2010 .

[15]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[16]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[17]  Wanli Liu,et al.  Author Name Disambiguation for PubMed , 2013, J. Assoc. Inf. Sci. Technol..

[18]  Adriano Veloso,et al.  Active associative sampling for author name disambiguation , 2012, JCDL '12.

[19]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[20]  Hao Wu,et al.  Unsupervised author disambiguation using Dempster–Shafer theory , 2014, Scientometrics.

[21]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[22]  Wagner Meira,et al.  Cost-effective on-demand associative author name disambiguation , 2012, Inf. Process. Manag..

[23]  Adriano Veloso,et al.  Self‐training author name disambiguation for information scarce scenarios , 2014, J. Assoc. Inf. Sci. Technol..

[24]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[25]  Wei Xu,et al.  A hierarchical naive Bayes mixture model for name disambiguation in author citations , 2005, SAC '05.