A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments

National exercises for the evaluation of research activity by universities are becoming regular practice in ever more countries. These exercises have mainly been conducted through the application of peer-review methods. Bibliometrics has not been able to offer a valid large-scale alternative because of almost overwhelming difficulties in identifying the true author of each publication. We will address this problem by presenting a heuristic approach to author name disambiguation in bibliometric datasets for large-scale research assessments. The application proposed concerns the Italian university system, comprising 80 universities and a research staff of over 60,000 scientists. The key advantage of the proposed approach is the ease of implementation. The algorithms are of practical application and have considerably better scalability and expandability properties than state-of-the-art unsupervised approaches. Moreover, the performance in terms of precision and recall, which can be further improved, seems thoroughly adequate for the typical needs of large-scale bibliometric research assessments. © 2011 Wiley Periodicals, Inc.

[1]  Jonathan Grant,et al.  Co-author inclusion: A novel recursive algorithmic method for dealingwith homonyms in bibliometric analysis , 2006, Scientometrics.

[2]  José M. Soler Separating the articles of authors with the same name , 2007, Scientometrics.

[3]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[4]  Nigel Shadbolt,et al.  Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[5]  Laurel L. Cornell,et al.  Duplication of japanese names: a problem in citations and bibliographies , 1982, J. Am. Soc. Inf. Sci..

[6]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[7]  Giovanni Abramo,et al.  Assessment of sectoral aggregation distortion in research productivity measurements , 2008 .

[8]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[9]  Grant Harman,et al.  Allocating Research Infrastructure Grants in Post-binary Higher Education Systems: British and Australian approaches , 2000 .

[10]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles , 2005 .

[11]  Christopher Joseph Pal,et al.  Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[12]  Santo Fortunato,et al.  Diffusion of scientific credits and the ranking of scientists , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  S. Breschi,et al.  Networks of inventors and the role of academia: an exploration of Italian patent data , 2004 .

[14]  Wei Xu,et al.  A hierarchical naive Bayes mixture model for name disambiguation in author citations , 2005, SAC '05.

[15]  Martin Meyer,et al.  Academic patents as an indicator of useful research? A new approach to measure academic inventiveness , 2003 .

[16]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[17]  Giovanni Abramo,et al.  The measurement of Italian universities’ research productivity by a non parametric-bibliometric methodology , 2008, Scientometrics.

[18]  Giovanni Abramo,et al.  Peer review research assessment: a sensitivity analysis of performance rankings to the share of research product evaluated , 2010, Scientometrics.

[19]  Magnus Gulbrandsen,et al.  A baseline for the impact of academic patenting legislation in Norway , 2007, Scientometrics.

[20]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[21]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[22]  Anthony F. J. van Raan Scaling rules in the science system: Influence of field-specific citation characteristics on the impact of research groups , 2008 .

[23]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[24]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[25]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[26]  Dag W. Aksnes,et al.  When different persons have an identical author name. How frequent are homonyms? , 2008, J. Assoc. Inf. Sci. Technol..

[27]  Allocative Efficiency in Public Budget Distribution , 1984 .