A Web Service for Author Name Disambiguation in Scholarly Databases

Author Name Disambiguation (AND) is the task of clustering unique author names from publication records in scholarly or related databases. Although AND has been extensively studied and has served as an important preprocessing step for several tasks (e.g. calculating bibliometrics and scientometrics for authors), there are few publicly available tools for disambiguation in large-scale scholarly databases. Furthermore, most of the disambiguated data is embedded within the search engines of the scholarly databases, and existing application programming interfaces (APIs) have limited features and are often unavailable for users for various reasons. This makes it difficult for researchers and developers to use the data for various applications (e.g. author search) or research. Here, we design a novel, web-based, RESTful API for searching disambiguated authors, using the PubMed database as a sample application. We offer two type of queries, attribute-based queries and record-based queries which serve different purposes. Attribute-based queries retrieve authors with the attributes available in the database. We study different search engines to find the most appropriate one for processing attribute-based queries. Record-based queries retrieve authors that are most likely to have written a query publication provided by a user. To accelerate record-based queries, we develop a novel algorithm that has a fast record-to-cluster match. We show that our algorithm can accelerate the query by a factor of 4.01 compared to a baseline naive approach.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Christoph Müller Semantic Author Name Disambiguation with Word Embeddings , 2017, TPDL.

[3]  Wanli Liu,et al.  Author Name Disambiguation for PubMed , 2013, J. Assoc. Inf. Sci. Technol..

[4]  Ron S. Jarmin,et al.  Wrapping it up in a person: Examining employment and earnings outcomes for Ph.D. recipients , 2015, Science.

[5]  Harriet Zuckerman,et al.  Age, aging, and age structure in science , 1968 .

[6]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[7]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Madian Khabsa,et al.  A Web Service for Scholarly Big Data Information Extraction , 2014, 2014 IEEE International Conference on Web Services.

[10]  Dean Keith Simonton,et al.  Creative productivity: A predictive and explanatory model of career trajectories and landmarks. , 1997 .

[11]  Prashant Doshi,et al.  Towards Automated RESTful Web Service Composition , 2009, 2009 IEEE International Conference on Web Services.

[12]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[13]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14]  C. Lee Giles,et al.  Financial Entity Record Linkage with Random Forests , 2016, DSMM@SIGMOD.

[15]  Qinghua Zheng,et al.  Dynamic author name disambiguation for growing digital libraries , 2015, Information Retrieval Journal.

[16]  Paula E. Stephan,et al.  Research Productivity over the Life Cycle: Evidence for Academic Scientists , 1991 .

[17]  Madian Khabsa,et al.  Random Forest DBSCAN Clustering for USPTO Inventor Name Disambiguation and Conflation , 2016 .

[18]  Hui Han,et al.  A service-oriented architecture for digital libraries , 2004, ICSOC '04.

[19]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[20]  Benjamin F. Jones,et al.  Age dynamics in scientific creativity , 2011, Proceedings of the National Academy of Sciences.

[21]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[22]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[23]  Madian Khabsa,et al.  Online Person Name Disambiguation with Constraints , 2015, JCDL.

[24]  Jay Bhattacharya,et al.  Age and the Trying Out of New Ideas , 2015, Journal of Human Capital.

[25]  Benjamin F. Jones,et al.  The dual frontier: Patented inventions and prior scientific advance , 2017, Science.

[26]  A. Barabasi,et al.  Quantifying the evolution of individual scientific impact , 2016, Science.

[27]  Yanhong Wu,et al.  NameClarifier: A Visual Analytics System for Author Name Disambiguation , 2017, IEEE Transactions on Visualization and Computer Graphics.

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.