CONNA: Addressing Name Disambiguation on The Fly

Name disambiguation is a key and also a very touch problem in many online systems such as social search and academic search. Despite considerable research, a critical issue that has not been systematically studied is disambiguation on the fly to complete the disambiguation in real time. This is very challenging, as the disambiguation algorithm must be accurate, efficient, and error tolerance. In this paper, we propose a novel framework CONNA to train a matching component and a decision component jointly via reinforcement learning. The matching component is responsible for finding the top matched candidate for the given paper, and the decision component is responsible for deciding on assigning the top matched person or creating a new person. The two components are intertwined and can be bootstrapped via jointly training. Empirically, we evaluate CONNA on AMiner a large online academic search system. Experimental results show that the proposed framework can achieve a 5.37%-19.84% improvement on F1 score using joint training of the matching and the decision components. The proposed CONNA has been successfully deployed on AMiner.

[1]  Panos Kalnis,et al.  Private queries in location based services: anonymizers are not necessary , 2008, SIGMOD Conference.

[2]  Zhiyuan Liu,et al.  Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search , 2018, WSDM.

[3]  Nitesh V. Chawla,et al.  Camel: Content-Aware and Meta-path Augmented Metric Learning for Author Identification , 2018, WWW.

[4]  Catuscia Palamidessi,et al.  Geo-indistinguishability: differential privacy for location-based systems , 2012, CCS.

[5]  Seungwoo Lee,et al.  Construction of a large-scale test set for author disambiguation , 2011, Inf. Process. Manag..

[6]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[7]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[8]  Ling Liu,et al.  Supporting anonymous location queries in mobile environments with privacygrid , 2008, WWW.

[9]  Fan Zhang,et al.  What's in a name?: an unsupervised approach to link users across communities , 2013, WSDM.

[10]  Houfeng Wang,et al.  Learning Entity Representation for Entity Disambiguation , 2013, ACL.

[11]  Rui Li,et al.  Secure KNN Queries over Encrypted Data: Dimensionality Is Not Always a Curse , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[12]  Xueqi Cheng,et al.  Match-SRNN: Modeling the Recursive Matching Structure with Spatial RNN , 2016, IJCAI.

[13]  John C. Mitchell,et al.  Privacy-Preserving Shortest Path Computation , 2016, NDSS.

[14]  Qinghua Li,et al.  Achieving k-anonymity in privacy-aware location-based services , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[15]  Ivan Damgård,et al.  A Generalisation, a Simplification and Some Applications of Paillier's Probabilistic Public-Key System , 2001, Public Key Cryptography.

[16]  Elisa Bertino,et al.  Privacy-Preserving and Content-Protecting Location Based Queries , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[17]  Jian Su,et al.  Entity Linking with Effective Acronym Expansion, Instance Selection, and Topic Modeling , 2011, IJCAI.

[18]  Wei Shen,et al.  Linking named entities in Tweets with knowledge base via user interest modeling , 2013, KDD.

[19]  Liviu Iftode,et al.  Privately querying location-based services with SybilQuery , 2009, UbiComp.

[20]  Paul McNamee HLTCOE Efforts in Entity Linking at TAC KBP 2010 , 2010, TAC.

[21]  Xing Zhao The scorecard solution to the author-paper identification challenge , 2013, KDD Cup '13.

[22]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Christopher D. Manning,et al.  Deep Reinforcement Learning for Mention-Ranking Coreference Models , 2016, EMNLP.

[25]  Philip S. Yu,et al.  ADANA: Active Name Disambiguation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[26]  T. H. Tse,et al.  Toward a K-means clustering approach to adaptive random testing for object-oriented software , 2019, Science China Information Sciences.

[27]  Elisa Bertino,et al.  Privacy-Preserving and Content-Protecting Location Based Queries , 2014, IEEE Trans. Knowl. Data Eng..

[28]  Yuanchun Zhou,et al.  Unsupervised Author Disambiguation using Heterogeneous Graph Convolutional Network Embedding , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[29]  Mohammad Al Hasan,et al.  Name Disambiguation in Anonymized Graphs using Network Embedding , 2017, CIKM.

[30]  Dimitris Papadias,et al.  Aggregate nearest neighbor queries in road networks , 2005, IEEE Transactions on Knowledge and Data Engineering.

[31]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[32]  Chi-Yin Chow,et al.  A peer-to-peer spatial cloaking algorithm for anonymous location-based service , 2006, GIS '06.

[33]  Hiroshi Nakagawa,et al.  Person name disambiguation by bootstrapping , 2010, SIGIR.

[34]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[35]  Yi Mu,et al.  One-Round Privacy-Preserving Meeting Location Determination for Smartphone Applications , 2016, IEEE Transactions on Information Forensics and Security.

[36]  Yizhou Sun,et al.  Task-Guided and Path-Augmented Heterogeneous Network Embedding for Author Identification , 2016, WSDM.

[37]  D. C. Howell Statistical Methods for Psychology , 1987 .

[38]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[39]  Jie Tang,et al.  Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. , 2018, KDD.

[40]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[41]  Jerome H. Saltzer,et al.  The protection of information in computer systems , 1975, Proc. IEEE.

[42]  Tobias Backes,et al.  The Impact of Name-Matching and Blocking on Author Disambiguation , 2018, CIKM.

[43]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[44]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Gilles Louppe,et al.  Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning , 2015, KESW.

[46]  Hua Lu,et al.  SpaceTwist: Managing the Trade-Offs Among Location Privacy, Query Performance, and Query Accuracy in Mobile Services , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[47]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[48]  Li Xiong,et al.  Protecting Locations with Differential Privacy under Temporal Correlations , 2014, CCS.

[49]  Walid G. Aref,et al.  Casper*: Query processing for location services without compromising privacy , 2006, TODS.

[50]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[51]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[52]  Juan-Zi Li,et al.  A constraint-based topic modeling approach for name disambiguation , 2009, Frontiers of Computer Science in China.

[53]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[54]  Elisa Bertino,et al.  Practical Approximate k Nearest Neighbor Queries with Location and Query Privacy , 2016, IEEE Transactions on Knowledge and Data Engineering.

[55]  Hiroyuki Shindo,et al.  Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation , 2016, CoNLL.

[56]  Xiaolong Wang,et al.  Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation , 2015, IJCAI.

[57]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[58]  Ying Shi,et al.  LCC Approaches to Knowledge Base Population at TAC 2010 , 2010, TAC.

[59]  Tetsuji Satoh,et al.  An anonymous communication technique using dummies for location-based services , 2005, ICPS '05. Proceedings. International Conference on Pervasive Services, 2005..

[60]  Katy Börner,et al.  ‘Seed + expand’: a general methodology for detecting publication oeuvres of individual researchers , 2014, Scientometrics.

[61]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[62]  Jing Jiang,et al.  Linking Entities to a Knowledge Base with Query Expansion , 2011, EMNLP.

[63]  Xinbing Wang,et al.  Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning , 2020, AAAI.

[64]  Kang Chen,et al.  Uncertainty-optimized deep learning model for small-scale person re-identification , 2019, Science China Information Sciences.

[65]  Stavros Papadopoulos,et al.  Nearest neighbor search with strong location privacy , 2010, Proc. VLDB Endow..

[66]  Martine De Cock,et al.  The Microsoft academic search dataset and KDD Cup 2013 , 2013, KDD Cup '13.

[67]  Christoph Müller,et al.  Data sets for author name disambiguation: an empirical analysis and a new resource , 2017, Scientometrics.

[68]  Dmitry Efimov,et al.  KDD Cup 2013 - author-paper identification challenge: second place team , 2013, KDD Cup '13.

[69]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[70]  Thomas Hofmann,et al.  End-to-End Neural Entity Linking , 2018, CoNLL.

[71]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[72]  Xianpei Han,et al.  A Generative Entity-Mention Model for Linking Entities with Knowledge Base , 2011, ACL.

[73]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[74]  Nikos Mamoulis,et al.  Secure kNN computation on encrypted databases , 2009, SIGMOD Conference.

[75]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[76]  Jie Tang,et al.  A Combination Approach to Web User Profiling , 2010, TKDD.

[77]  Andrew Chi-Chih Yao,et al.  Protocols for Secure Computations (Extended Abstract) , 1982, FOCS.

[78]  Weidong Yang,et al.  Feature engineering and tree modeling for author-paper identification challenge , 2013, KDD Cup '13.

[79]  Satoshi Matsuoka,et al.  Scaling Word2Vec on Big Corpus , 2019, Data Science and Engineering.

[80]  Maede Ashouri-Talouki,et al.  GLP: A cryptographic approach for group location privacy , 2012, Comput. Commun..

[81]  Xueqi Cheng,et al.  Text Matching as Image Recognition , 2016, AAAI.

[82]  Jean-Pierre Hubaux,et al.  Privacy-Preserving Optimal Meeting Location Determination on Mobile Devices , 2014, IEEE Transactions on Information Forensics and Security.

[83]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .