A constraint-based topic modeling approach for name disambiguation

Name ambiguity refers to a problem that different people might be referenced with an identical name. This problem has become critical in many applications, particularly in online bibliography systems, such as DBLP and CiterSeer. Although much work has been conducted to address this problem, there still exist many challenges. In this paper, a general framework of constraint-based topic modeling is proposed, which can make use of user-defined constraints to enhance the performance of name disambiguation. A Gibbs sampling algorithm that integrates the constraints has been proposed to do the inference of the topic model. Experimental results on a real-world dataset show that significant improvements can be obtained by taking the proposed approach.

[1]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[2]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[4]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[7]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[8]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[9]  Jiawei Han,et al.  Spectral Regression for Dimensionality Reduction , 2007 .

[10]  Byung-Won On,et al.  Scalable Name Disambiguation using Multi-level Graph Partition , 2007, SDM.

[11]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[12]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[13]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[14]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[15]  Min-Yen Kan,et al.  PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[16]  Dan Roth,et al.  Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[17]  Edoardo M. Airoldi,et al.  A Network Analysis Model for Disambiguation of Names in Lists , 2005, Comput. Math. Organ. Theory.

[18]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[19]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[20]  Yang Song,et al.  Generative models for name disambiguation , 2007, WWW '07.

[21]  Juan-Zi Li,et al.  A Unified Probabilistic Framework for Name Disambiguation in Digital Library , 2012, IEEE Transactions on Knowledge and Data Engineering.

[22]  C. Lee Giles,et al.  Fast Author Name Disambiguation in CiteSeer , 2006 .

[23]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[24]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[27]  Wei Xu,et al.  A hierarchical naive Bayes mixture model for name disambiguation in author citations , 2005, SAC '05.

[28]  Dmitri V. Kalashnikov,et al.  Adaptive graphical approach to entity resolution , 2007, JCDL '07.

[29]  Gregor Heinrich Parameter estimation for text analysis , 2009 .