论文信息 - Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.

Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.

AMiner 1 is a free online academic search and mining system, having collected more than 130,000,000 researcher profiles and over 200,000,000 papers from multiple publication databases [25]. In this paper, we present the implementation and deployment of name disambiguation , a core component in AMiner. The problem has been studied for decades but remains largely unsolved. In AMiner, we did a systemic investigation into the problem and propose a comprehensive framework to address the problem. We propose a novel representation learning method by incorporating both global and local information and present an end-to-end cluster size estimation method that is significantly better than traditional BIC-based method. To improve accuracy, we involve human annotators into the disambiguation process. We carefully evaluate the proposed framework on real-world large data and experimental results show that the proposed solution achieves clearly better performance (+7-35% in terms of F1-score) than several state-of-the-art methods including GHOST [5], Zhang et al. [33], and Louppe et al. [17]. Finally, the algorithm has been deployed in AMiner to deal with the disambiguation problem at the billion scale, which further demonstrates both effectiveness and efficiency of the proposed framework.

[1] Christopher Joseph Pal,et al. Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[2] Lise Getoor,et al. Collective entity resolution in relational data , 2007, TKDD.

[3] Jie Tang,et al. ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[4] Dan Roth,et al. Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[5] Mohammad Al Hasan,et al. Name Disambiguation in Anonymized Graphs using Network Embedding , 2017, CIKM.

[6] C. Lee Giles,et al. Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[7] Andrew McCallum,et al. Probabilistic Reasoning about Human Edits in Information Integration , 2013 .

[8] Devdatt P. Dubhashi,et al. Entity disambiguation in anonymized graphs using graph kernels , 2013, CIKM.

[9] Philip S. Yu,et al. Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10] Jennifer Widom,et al. Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[11] Enhong Chen,et al. Learning Deep Representations for Graph Clustering , 2014, AAAI.

[12] Jie Tang,et al. A Combination Approach to Web User Profiling , 2010, TKDD.

[13] Samy Bengio,et al. Neural Combinatorial Optimization with Reinforcement Learning , 2016, ICLR.

[14] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jianmin Wu,et al. Integrated network analysis platform for protein-protein interactions , 2009, Nature Methods.

[16] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[17] C. Giraud. Introduction to High-Dimensional Statistics , 2014 .

[18] Hiroshi Nakagawa,et al. Person name disambiguation by bootstrapping , 2010, SIGIR.

[19] Andrew McCallum,et al. Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[20] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21] David A. Wagner,et al. A Generalized Birthday Problem , 2002, CRYPTO.

[22] Fabian M. Suchanek,et al. Canonicalizing Open Knowledge Bases , 2014, CIKM.

[23] Andrew McCallum,et al. A Discriminative Hierarchical Model for Fast Coreference at Large Scale , 2012, ACL.

[24] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[25] Gilles Louppe,et al. Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning , 2015, KESW.

[26] Stephen E. Fienberg,et al. A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[27] Jianyong Wang,et al. GRAPE: A Graph-Based Framework for Disambiguating People Appearances in Web Search , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[28] Philip S. Yu,et al. COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency , 2015, KDD.

[29] Max Welling,et al. Variational Graph Auto-Encoders , 2016, ArXiv.

[30] C. Lee Giles,et al. Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[31] Andrew McCallum,et al. Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[32] Jianyong Wang,et al. On Graph-Based Name Disambiguation , 2011, JDIQ.

[33] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.