Disambiguating authors in academic publications using random forests

Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pair-wise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.

[1]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[2]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[3]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[4]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[5]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[6]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[9]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[10]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[11]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles , 2005 .

[12]  Wray L. Buntine,et al.  A Further Comparison of Splitting Rules for Decision-Tree Induction , 1992, Machine Learning.

[13]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[15]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[17]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[18]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[19]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[20]  W. Winkler SERIES ( Statistics # 2005-02 ) Approximate String Comparator Search Strategies for Very Large Administrative Lists , 2005 .

[21]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[22]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[23]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[24]  Grace Hui Yang,et al.  Near-duplicate detection for eRulemaking , 2005, DG.O.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[27]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[28]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..