Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.

AMiner 1 is a free online academic search and mining system, having collected more than 130,000,000 researcher profiles and over 200,000,000 papers from multiple publication databases [25]. In this paper, we present the implementation and deployment of name disambiguation , a core component in AMiner. The problem has been studied for decades but remains largely unsolved. In AMiner, we did a systemic investigation into the problem and propose a comprehensive framework to address the problem. We propose a novel representation learning method by incorporating both global and local information and present an end-to-end cluster size estimation method that is significantly better than traditional BIC-based method. To improve accuracy, we involve human annotators into the disambiguation process. We carefully evaluate the proposed framework on real-world large data and experimental results show that the proposed solution achieves clearly better performance (+7-35% in terms of F1-score) than several state-of-the-art methods including GHOST [5], Zhang et al. [33], and Louppe et al. [17]. Finally, the algorithm has been deployed in AMiner to deal with the disambiguation problem at the billion scale, which further demonstrates both effectiveness and efficiency of the proposed framework.

[1]  Christopher Joseph Pal,et al.  Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[2]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[3]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[4]  Dan Roth,et al.  Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[5]  Mohammad Al Hasan,et al.  Name Disambiguation in Anonymized Graphs using Network Embedding , 2017, CIKM.

[6]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[7]  Andrew McCallum,et al.  Probabilistic Reasoning about Human Edits in Information Integration , 2013 .

[8]  Devdatt P. Dubhashi,et al.  Entity disambiguation in anonymized graphs using graph kernels , 2013, CIKM.

[9]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[11]  Enhong Chen,et al.  Learning Deep Representations for Graph Clustering , 2014, AAAI.

[12]  Jie Tang,et al.  A Combination Approach to Web User Profiling , 2010, TKDD.

[13]  Samy Bengio,et al.  Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[14]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jianmin Wu,et al.  Integrated network analysis platform for protein-protein interactions , 2009, Nature Methods.

[16]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[17]  C. Giraud Introduction to High-Dimensional Statistics , 2014 .

[18]  Hiroshi Nakagawa,et al.  Person name disambiguation by bootstrapping , 2010, SIGIR.

[19]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  David A. Wagner,et al.  A Generalized Birthday Problem , 2002, CRYPTO.

[22]  Fabian M. Suchanek,et al.  Canonicalizing Open Knowledge Bases , 2014, CIKM.

[23]  Andrew McCallum,et al.  A Discriminative Hierarchical Model for Fast Coreference at Large Scale , 2012, ACL.

[24]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[25]  Gilles Louppe,et al.  Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning , 2015, KESW.

[26]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[27]  Jianyong Wang,et al.  GRAPE: A Graph-Based Framework for Disambiguating People Appearances in Web Search , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[28]  Philip S. Yu,et al.  COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency , 2015, KDD.

[29]  Max Welling,et al.  Variational Graph Auto-Encoders , 2016, ArXiv.

[30]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[31]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[32]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[33]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.