Person Name Disambiguation in Web Pages Using Social Network, Compound Words and Latent Topics

The World Wide Web (WWW) provides much information about persons, and in recent years WWW search engines have been commonly used for learning about persons. However, many persons have the same name and that ambiguity typically causes the search results of one person name to include Web pages about several different persons. We propose a novel framework for person name disambiguation that has the following three components processes. Extraction of social network information by finding co-occurrences of named entities, Measurement of document similarities based on occurrences of key compound words, Inference of topic information from documents based on the Dirichlet process unigram mixture model. Experiments using an actual Web document dataset show that the result of our framework is promising.

[1]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[2]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[3]  Hiroshi Nakagawa Automatic term recognition based on statistics of compound nouns , 2000 .

[4]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[5]  Hiroshi Nakagawa,et al.  Automatic term recognition based on statistics of compound nouns and their components , 2003 .

[6]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[7]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[8]  Thomas S. Morton,et al.  Coreference for NLP Applications , 2000, ACL.

[9]  Hiroshi Nakagawa,et al.  NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages , 2006, AIRS.

[10]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[11]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[12]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[13]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[14]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[15]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[16]  Cheng Niu,et al.  Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction , 2004, ACL.

[17]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.