Resolving Person Names in Web People Search

Disambiguating person names in a set of documents (such as a set of web pages returned in response to a person name) is a key task for the presentation of results and the automatic profiling of experts. With largely unstructured documents and an unknown number of people with the same name the problem presents many difficulties and challenges. This chapter treats the task of person name disambiguation as a document clustering problem, where it is assumed that the documents represent particular people. This leads to the person cluster hypothesis, which states that similar documents tend to represent the same person. Single Pass Clustering, k-Means Clustering, Agglomerative Clustering and Probabilistic Latent Semantic Analysis are employed and empirically evaluated in this context. On the SemEval 2007 Web People Search it is shown that the person cluster hypothesis holds reasonably well and that the Single Pass Clustering and Agglomerative Clustering methods provide the best performance.

[1]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[2]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[3]  David W. Embley,et al.  Grouping search-engine returned citations for person-name queries , 2004, WIDM '04.

[4]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[5]  M. de Rijke,et al.  Personal Name Resolution of Web People Search , 2008 .

[6]  Krisztian Balog,et al.  People search in the enterprise , 2007, SIGF.

[7]  Eduard Hovy,et al.  Multi-Document Person Name Resolution , 2004 .

[8]  Danushka Bollegala,et al.  Extracting Key Phrases to Disambiguate Personal Name Queries in Web Search , 2006 .

[9]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[10]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[11]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[12]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[13]  M. Taffet Looking Ahead to Person Resolution , 2004 .

[14]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[15]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[16]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[17]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[18]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[19]  Thomas Kalt,et al.  A New Probabilistic Model of Text Classification and Retrieval , 1998 .

[20]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[21]  Julio Gonzalo,et al.  A testbed for people searching strategies in the WWW , 2005, SIGIR '05.

[22]  Susumu Horiguchi,et al.  Personal Name Resolution Crossover Documents by a Semantics-Based Approach , 2006, IEICE Trans. Inf. Syst..

[23]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[24]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[25]  David M. Pennock,et al.  Categories and Subject Descriptors , 2001 .

[26]  Maarten de Rijke,et al.  Associating People and Documents , 2008, ECIR.

[27]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.