Author Name Disambiguation in Bibliographic Databases: A Survey

Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.

[1]  Hai Jin,et al.  Name Disambiguation Using Semantic Association Clustering , 2009, 2009 IEEE International Conference on e-Business Engineering.

[2]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[3]  Jun Zhang,et al.  A semantic query approach to personalized e-Catalogs service system , 2010, J. Theor. Appl. Electron. Commer. Res..

[4]  Julio Miró-Borrás,et al.  Text Entry in the E-Commerce Age: Two Proposals for the Severely Handicapped , 2009, J. Theor. Appl. Electron. Commer. Res..

[5]  Alan W. Biermann,et al.  Coreference, cross-document coreference, and information extraction methodologies , 1998 .

[6]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[7]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[8]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[9]  Juan-Zi Li,et al.  Name Disambiguation Using Atomic Clusters , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[10]  Kai-Hsiang Yang,et al.  Author Name Disambiguation in Citations , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[11]  Jian Pei,et al.  An effective approach to entity resolution problem using quasi-clique and its application to digital libraries , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[12]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[13]  Dmitri V. Kalashnikov,et al.  Adaptive graphical approach to entity resolution , 2007, JCDL '07.

[14]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[15]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[16]  Evan R. Sprague,et al.  ORCID , 2017, Journal of the Medical Library Association : JMLA.

[17]  Je-Min Kim,et al.  OnCU system: ontology-based category utility approach for author name disambiguation , 2008, ICUIMC '08.

[18]  Juan-Zi Li,et al.  A Unified Probabilistic Framework for Name Disambiguation in Digital Library , 2012, IEEE Transactions on Knowledge and Data Engineering.

[19]  Ali Daud,et al.  Impact of mutual influence while ranking authors in a co-authorship network , 2016 .

[20]  Patrick Reuther Personal Name Matching: New Test Collections and a Social Network based Approach , 2006, Universität Trier, Mathematik/Informatik, Forschungsbericht.

[21]  Qinghua Zheng,et al.  Dynamic author name disambiguation for growing digital libraries , 2015, Information Retrieval Journal.

[22]  Jan-Ming Ho,et al.  Author Name Disambiguation for Citations Using Topic and Web Correlation , 2008, ECDL.

[23]  Gonzalo Álvarez,et al.  Word sense disambiguation for spam filtering , 2012, Electron. Commer. Res. Appl..

[24]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[25]  Alex A. Freitas,et al.  The Knowledge Discovery Process , 2000 .

[26]  Joongmin Choi,et al.  Automatic Method for Author Name Disambiguation Using Social Networks , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[27]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[28]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[29]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  Filippo Menczer,et al.  Detecting Ambiguous Author Names in Crowdsourced Scholarly Data , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[31]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[32]  Ali Daud,et al.  Group topic modeling for academic knowledge discovery , 2012, Applied Intelligence.

[33]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[34]  Juan-Zi Li,et al.  A constraint-based topic modeling approach for name disambiguation , 2009, Frontiers of Computer Science in China.

[35]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[36]  Jian Pei,et al.  On mining cross-graph quasi-cliques , 2005, KDD '05.

[37]  Hamish Cunningham,et al.  Adopting ontologies for multisource identity resolution , 2008, OBI '08.

[38]  Raphael Volz,et al.  Ontology based entity disambiguation with natural language patterns , 2009, 2009 Fourth International Conference on Digital Information Management.

[39]  Wei Xu,et al.  A hierarchical naive Bayes mixture model for name disambiguation in author citations , 2005, SAC '05.

[40]  S. Pandit,et al.  A Comparative Study on Distance Measuring Approaches for Clustering , 2011 .

[41]  Georgios Evangelidis,et al.  The Universal Author Identifier System (UAI_Sys) , 2006 .

[42]  Ismailcem Budak Arpinar,et al.  Ontology-Driven Automatic Entity Disambiguation in Unstructured Text , 2006, SEMWEB.

[43]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[44]  Jian Pei,et al.  Improving Grouped-Entity Resolution Using Quasi-Cliques , 2006, Sixth International Conference on Data Mining (ICDM'06).

[45]  Qinghua Zheng,et al.  Combining machine learning and human judgment in author disambiguation , 2011, CIKM '11.

[46]  Byung-Won On,et al.  Scalable Name Disambiguation using Multi-level Graph Partition , 2007, SDM.

[47]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[48]  Tru H. Cao,et al.  Enriching Ontologies for Named Entity Disambiguation , 2010 .

[49]  Juan-Zi Li,et al.  Knowledge discovery through directed probabilistic topic models: a survey , 2010, Frontiers of Computer Science in China.

[50]  Rutger van Haasteren,et al.  Gibbs Sampling , 2010, Encyclopedia of Machine Learning.

[51]  Marcos André Gonçalves,et al.  A Heuristic-based Hierarchical Clustering Method for Author Name Disambiguation in Digital Libraries , 2007, SBBD.

[52]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[53]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[54]  Filippo Menczer,et al.  Crowdsourcing Scholarly Data , 2010 .

[55]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[56]  Chenchen Sun,et al.  Topological Features Based Entity Disambiguation , 2016, Journal of Computer Science and Technology.

[57]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[58]  Nigel Shadbolt,et al.  Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[59]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[60]  Rashid Ali,et al.  Author name disambiguation using vector space model and hybrid similarity measures , 2014, 2014 Seventh International Conference on Contemporary Computing (IC3).

[61]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[62]  José M. Soler Separating the articles of authors with the same name , 2007, Scientometrics.

[63]  Philip S. Yu,et al.  ADANA: Active Name Disambiguation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[64]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[65]  Félix de Moya Anegón,et al.  Approximate personal name-matching through finite-state graphs , 2007, J. Assoc. Inf. Sci. Technol..

[66]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[67]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[68]  Ali Daud,et al.  MuICE: Mutual Influence and Citation Exclusivity Author Rank , 2016, Inf. Process. Manag..

[69]  Juan-Zi Li,et al.  Temporal expert finding through generalized time topic modeling , 2010, Knowl. Based Syst..

[70]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[71]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[73]  Adriano Veloso,et al.  Effective self-training author name disambiguation in scholarly digital libraries , 2010, JCDL '10.

[74]  Caryn L Scoville,et al.  When A. Rose Is Not A. Rose , 2003, Medical reference services quarterly.

[75]  Tru H. Cao,et al.  Named entity disambiguation on an ontology enriched by Wikipedia , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[76]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[77]  Murat Dundar,et al.  Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams , 2016, CIKM.

[78]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[79]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[80]  Karl Branting A comparative evaluation of name-matching algorithms , 2003, ICAIL.

[81]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[82]  Hui Han,et al.  A Model-based K-means Algorithm for Name Disambiguation , 2003 .

[83]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[84]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[85]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[86]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[87]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[88]  Hao Wu,et al.  Unsupervised author disambiguation using Dempster–Shafer theory , 2014, Scientometrics.

[89]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..

[90]  Jaeyoung Yang,et al.  Detecting Collaborative Fields Using Social Networks , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[91]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[92]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[93]  Ali Daud,et al.  Using time topic modeling for semantics-based dynamic research interest finding , 2012, Knowl. Based Syst..

[94]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[95]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[96]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[97]  Taehwan Kim,et al.  Author name disambiguation using a graph model with node splitting and merging based on bibliographic information , 2014, Scientometrics.

[98]  Madian Khabsa,et al.  Online Person Name Disambiguation with Constraints , 2015, JCDL.