Effective string processing and matching for author disambiguation

Track 2 in KDD Cup 2013 aims at determining duplicated authors in a data set from Microsoft Academic Search. This type of problems appears in many large-scale applications that compile information from different sources. This paper describes our solution developed at National Taiwan University to win the first prize of the competition. We propose an effective name matching framework and realize two implementations. An important strategy in our approach is to consider Chinese and non-Chinese names separately because of their different naming conventions. Post-processing including merging results of two predictions further boosts the performance. Our approach achieves F1-score 0.99202 on the private leader board, while 0.99195 on the public leader board.

[1]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[2]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[3]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[4]  ChinWei-Sheng,et al.  Effective string processing and matching for author disambiguation , 2014 .

[5]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[6]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[7]  Jiawei Han,et al.  Ranking-based name matching for author disambiguation in bibliographic data , 2013, KDD Cup '13.

[8]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[9]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[10]  Hector Garcia-Molina,et al.  Joint Entity Resolution , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[12]  Martine De Cock,et al.  The Microsoft academic search dataset and KDD Cup 2013 , 2013, KDD Cup '13.

[13]  Peng Wang,et al.  A semi-supervised approach for author disambiguation in KDD CUP 2013 , 2013, KDD Cup '13.

[14]  Dmitry Efimov,et al.  KDD Cup 2013: author disambiguation , 2013, KDD Cup '13.

[15]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.