Evaluating author name disambiguation for digital libraries: a case of DBLP

Author name ambiguity in a digital library may affect the findings of research that mines authorship data of the library. This study evaluates author name disambiguation in DBLP, a widely used but insufficiently evaluated digital library for its disambiguation performance. In doing so, this study takes a triangulation approach that author name disambiguation for a digital library can be better evaluated when its performance is assessed on multiple labeled datasets with comparison to baselines. Tested on three types of labeled data containing 5000 to 6 M disambiguated names, DBLP is shown to assign author names quite accurately to distinct authors, resulting in pairwise precision, recall, and F1 measures around 0.90 or above overall. DBLP’s author name disambiguation performs well even on large ambiguous name blocks but deficiently on distinguishing authors with the same names. Compared to other disambiguation algorithms, DBLP’s disambiguation performance is quite competitive, possibly due to its hybrid disambiguation approach combining algorithmic disambiguation and manual error correction. A discussion follows on strengths and weaknesses of labeled datasets used in this study for future efforts to evaluate author name disambiguation on a digital library scale.

[1]  Stasa Milojevic,et al.  Accuracy of simple, initials-based methods for author name disambiguation , 2013, J. Informetrics.

[2]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[3]  Florian Reitz,et al.  Learning from the Past: An Analysis of Person Name Corrections in DBLP Collection and Social Network Properties of Affected Entities , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[4]  A. Barabasi,et al.  Quantifying the evolution of individual scientific impact , 2016, Science.

[5]  Ljupco Todorovski,et al.  The effects of measurement error in case of scientific network analysis , 2015, Scientometrics.

[6]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Florian Reitz,et al.  Homonym Detection in Curated Bibliographies: Learning from dblp's Experience (full version) , 2018, TPDL.

[8]  Cristiano Giuffrida,et al.  A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments , 2011, J. Assoc. Inf. Sci. Technol..

[9]  Massimo Franceschet,et al.  Collaboration in computer science: a network science approach. Part II , 2011, ArXiv.

[10]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[11]  Geoffrey Bilder,et al.  Disambiguation without deduplication : Modeling authority and trust in the ORCID system , 2017 .

[12]  J. Ioannidis,et al.  Estimates of the Continuously Publishing Core in the Scientific Workforce , 2014, PloS one.

[13]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[14]  M. Newman,et al.  Coauthorship and citation patterns in the Physical Review. , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[16]  Henk F. Moed,et al.  Studying scientific migration in Scopus , 2013, Scientometrics.

[17]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[19]  Marcos André Gonçalves,et al.  An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Katy Börner,et al.  ‘Seed + expand’: a general methodology for detecting publication oeuvres of individual researchers , 2014, Scientometrics.

[21]  Marcos André Gonçalves,et al.  On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method , 2015, International Journal on Digital Libraries.

[22]  Jana Diesner,et al.  Distortive effects of initial‐based name disambiguation on measurements of large‐scale coauthorship networks , 2015, J. Assoc. Inf. Sci. Technol..

[23]  D. Hicks Performance-based university research funding systems , 2012 .

[24]  Seungwoo Lee,et al.  Construction of a large-scale test set for author disambiguation , 2011, Inf. Process. Manag..

[25]  Philipp Mayr,et al.  Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names , 2016, TPDL.

[26]  Christoph Müller,et al.  Data sets for author name disambiguation: an empirical analysis and a new resource , 2017, Scientometrics.

[27]  Jure Leskovec,et al.  Measurement error in network data: A re-classification , 2012, Soc. Networks.

[28]  Taehwan Kim,et al.  Author name disambiguation using a graph model with node splitting and merging based on bibliographic information , 2014, Scientometrics.

[29]  Andreas Strotmann,et al.  Author name disambiguation: What difference does it make in author-based citation analysis? , 2012, J. Assoc. Inf. Sci. Technol..

[30]  Olav Sorenson,et al.  Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists , 2016, PloS one.

[31]  Marcos André Gonçalves,et al.  An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations , 2010 .

[32]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[33]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[34]  Richard Van Noorden,et al.  Metrics: Do metrics matter? , 2010, Nature.

[35]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[36]  Jacqueline Sachse,et al.  Do Metrics Matter? , 2019, CHIIR.

[37]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[38]  Jana Diesner,et al.  The effect of data pre-processing on understanding the evolution of collaboration networks , 2015, J. Informetrics.

[39]  Philip S. Yu,et al.  ADANA: Active Name Disambiguation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[40]  Laura Paglione,et al.  ORCID: a system to uniquely identify researchers , 2012, Learn. Publ..

[41]  Qinghua Zheng,et al.  Dynamic author name disambiguation for growing digital libraries , 2015, Information Retrieval Journal.

[42]  Hirotaka Kawashima,et al.  Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan , 2015, Scientometrics.

[43]  Dirk Helbing,et al.  Exploiting citation networks for large-scale author name disambiguation , 2014, EPJ Data Science.

[44]  Vetle I. Torvik,et al.  Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption? , 2013, PloS one.

[45]  Gilles Louppe,et al.  Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning , 2015, KESW.

[46]  Wanli Liu,et al.  Author Name Disambiguation for PubMed , 2013, J. Assoc. Inf. Sci. Technol..