Author Name Disambiguation for PubMed

Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine‐learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false‐positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state‐of‐the‐art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state‐of‐the‐art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine‐learning method driven by a large‐scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click‐through‐rate of PubMed users on author name query results improved from 34.9% to 36.9%.

[1]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[2]  Nigel Shadbolt,et al.  Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[3]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[4]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[5]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[6]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[7]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[8]  Sarah Elliott Survey of Author Name Disambiguation: 2004 to 2010 , 2010 .

[9]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[10]  Markus Neuhäuser,et al.  Nonparametric Statistical Tests: A Computational Approach , 2011 .

[11]  G. Meek Mathematical statistics with applications , 1973 .

[12]  Johannes Fürnkranz,et al.  Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings , 2006, PKDD.

[13]  Christopher Joseph Pal,et al.  Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[14]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[15]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[16]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[17]  Jian Wang,et al.  A boosted-trees method for name disambiguation , 2012, Scientometrics.

[18]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[19]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[20]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[21]  Marcos André Gonçalves,et al.  An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations , 2010, J. Assoc. Inf. Sci. Technol..

[22]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[23]  José M. Soler Separating the articles of authors with the same name , 2007, Scientometrics.

[24]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[25]  W. W. Daniel Applied Nonparametric Statistics , 1979 .

[26]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[27]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[28]  Jian Pei,et al.  Improving Grouped-Entity Resolution Using Quasi-Cliques , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Lawrence H. Smith,et al.  PROBE: Periodic Random Orbiter Algorithm for Machine Learning , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[30]  Raffaella Bernardi,et al.  Metadata Enrichment via Topic Models for Author Name Disambiguation , 2009, NLP4DL/AT4DL.

[31]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[32]  Jean-Raymond Abrial,et al.  On B , 1998, B.

[33]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles , 2005 .

[34]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[35]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[36]  Adriano Veloso,et al.  Effective self-training author name disambiguation in scholarly digital libraries , 2010, JCDL '10.

[37]  Carlos Alberto Heuser,et al.  Evaluating the Use of Social Networks in Author Name Disambiguation in Digital Libraries , 2010, SBBD.

[38]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[39]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[40]  Cihan Varol,et al.  Author Disambiguation using an Hybrid Approach of Queries and String Matching Techniques , 2010, Int. J. Intell. Inf. Process..

[41]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[42]  Jan-Ming Ho,et al.  Author Name Disambiguation for Citations Using Topic and Web Correlation , 2008, ECDL.

[43]  W. John Wilbur,et al.  The Synergy Between PAV and AdaBoost , 2005, Machine Learning.

[44]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.