论文信息 - Ranking-based name matching for author disambiguation in bibliographic data

Ranking-based name matching for author disambiguation in bibliographic data

Author name ambiguity is a frequently encountered problem in digital publication libraries such as Microsoft Academic Search. The cause of this problem mostly is that different authors may publish under the same name, while the same author could publish under various names due to abbreviations, nicknames, etc. Author disambiguation is exactly the goal of the Track II of KDD Cup Data Mining Contest 2013. In this paper we introduce our ranking-based name matching algorithm and system called RankMatch. One important feature of our solution is using heterogeneous meta-paths to evaluate the similarity between two potential duplicate authors whose names are compatible. We participated under team name "SmallData" and our final solution achieved a Mean F1 score of 99.157%, ranking in the second place in the contest.

Jiawei Han | Jialu Liu | Chi Wang | Kin Hou Lei | Jeffery Yufei Liu

[1] Gloria J. A. Guth,et al. Surname Spellings and Computerized Record Linkage , 1976 .

[2] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[3] References , 1971 .

[4] Philip S. Yu,et al. Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5] David O. Holmes,et al. Improving precision and recall for Soundex retrieval , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[6] C. Lee Giles,et al. Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[7] Philip S. Yu,et al. PathSim , 2011, Proc. VLDB Endow..

[8] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9] Martine De Cock,et al. The Microsoft academic search dataset and KDD Cup 2013 , 2013, KDD Cup '13.

[10] Andrew McCallum,et al. Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[11] Lawrence Philips,et al. The double metaphone search algorithm , 2000 .

[12] Jennifer Widom,et al. SimRank: a measure of structural-context similarity , 2002, KDD.