Result Disambiguation in Web People Search

We study the problem of disambiguating the results of a web people search engine: given a query consisting of a person name plus the result pages for this query, find correct referents for all mentions by clustering the pages according to the different people sharing the name. While the problem has been studied extensively, we discover that the increasing availability of results retrieved from social media platforms causes state-of-the-art methods to break down. We analyze the problem and propose a dual strategy where we distinguish between results obtained from social media platforms and those obtained from other sources. In our dual strategy, the two types of documents are disambiguated separately, using different strategies, and their results are then merged. We study several instantiations for the different stages in our proposed strategy and manage to achieve state-of-the-art performance.

[1]  Xiaoying Gao,et al.  Improving Web clustering by cluster selection , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[2]  Byung-Won On,et al.  Scalable clustering methods for the name disambiguation problem , 2011, Knowledge and Information Systems.

[3]  Paolo Ferragina,et al.  A personalized search engine based on Web-snippet hierarchical clustering , 2008 .

[4]  Jean-Raymond Abrial,et al.  On B , 1998, B.

[5]  Wouter Weerkamp,et al.  A comparison of retrieval-based hierarchical clustering approaches to person name disambiguation , 2009, SIGIR.

[6]  Hiroshi Nakagawa,et al.  Person name disambiguation by bootstrapping , 2010, SIGIR.

[7]  Gerhard Paass,et al.  From names to entities using thematic context distance , 2011, CIKM '11.

[8]  Maarten de Rijke,et al.  People searching for people: analysis of a people search engine log , 2011, SIGIR '11.

[9]  M. de Rijke,et al.  Resolving Person Names in Web People Search , 2009, Weaving Services and People on the World Wide Web.

[10]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[11]  Fabrizio Sebastiani,et al.  A scalable algorithm for high-quality clustering of web snippets , 2006, SAC.

[12]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[13]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[14]  Chu-Ren Huang,et al.  PolyUHK: A Robust Information Extraction System for Web PersonalNames , 2009 .

[15]  Yanchun Zhang,et al.  Advanced Web Technologies and Applications , 2004, Lecture Notes in Computer Science.

[16]  Giansalvatore Mecca,et al.  A new algorithm for clustering search results , 2007, Data Knowl. Eng..

[17]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[18]  Heinz Dreher,et al.  Improving Web Search by Categorization, Clustering, and Personalization , 2008, ADMA.

[19]  Hiroshi Nakagawa,et al.  Person Name Disambiguation on the Web by Two-Stage Clustering , 2009 .

[20]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[21]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[22]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[23]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2008, Softw. Pract. Exp..

[24]  Julio Gonzalo,et al.  WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks , 2010, CLEF.

[25]  Julio Gonzalo,et al.  Combining Evaluation Metrics via the Unanimous Improvement Ratio and its Application to Clustering Tasks , 2011, J. Artif. Intell. Res..

[26]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[27]  Worapoj Kreesuradej,et al.  A New Web Search Result Clustering based on True Common Phrase Label Discovery , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).