Improving the performance of personal name disambiguation using web directories

Frequent requests from users to search engines on the World Wide Web are to search for information about people using personal names. Current search engines only return sets of documents containing the name queried, but, as several people usually share a personal name, the resulting sets often contain documents relevant to several people. It is necessary to disambiguate people in these result sets in order to to help users find the person of interest more readily. In the task of name disambiguation, effective measurement of similarities in the documents is a crucial step towards the final disambiguation. We propose a new method that uses web directories as a knowledge base to find common contexts in documents and uses the common contexts measure to determine document similarities. Experiments, conducted on documents mentioning real people on the web, together with several famous web directory structures, suggest that there are significant advantages in using web directories to disambiguate people compared with other conventional methods.

[1]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[2]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[3]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[4]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[5]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[6]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[7]  Danushka Bollegala,et al.  Extracting Key Phrases to Disambiguate Personal Name Queries in Web Search , 2006 .

[8]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[9]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Atsuhiro Takasu,et al.  Using Web Directories for Similarity Measurement in Personal Name Disambiguation , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[12]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[13]  Atsuhiro Takasu,et al.  Using a knowledge base to disambiguate personal name in web search results , 2007, SAC '07.

[14]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[15]  Breck Baldwin,et al.  University of Pennsylvania: description of the University of Pennsylvania system used for MUC-6 , 1995, MUC.

[16]  Akiko Aizawa The feature quantity: an information theoretic perspective of Tfidf-like measures , 2000, SIGIR '00.

[17]  Eero Hyvönen,et al.  Thirteenth International World Wide Web Conference Proceedings, WWW2004 , 2004 .

[18]  Sophia Ananiadou,et al.  The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms , 1998, ECDL.