Fuzzy Ants Clustering for Web People Search

A search engine query for a person’s name often brings up web pages corresponding to several people who share the same name. The Web People Search (WePS) problem involves organizing such search results for an ambiguous name query in meaningful clusters, that group together all web pages corresponding to one single individual. A particularly challenging aspect of this task is that it is in general not known beforehand how many clusters to expect. In this paper we therefore propose the use of a Fuzzy Ants clustering algorithm that does not rely on prior knowledge of the number of clusters that need to be found in the data. An evaluation on benchmark data sets from SemEval’s WePS1 and WePS2 competitions shows that the resulting system is competitive with the agglomerative clustering Agnes algorithm. This is particularly interesting as the latter involves manual setting of a similarity threshold (or estimating the number of clusters in advance) while the former does not.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Véronique Hoste,et al.  AUG: A combined classification and clustering approach for web people disambiguation , 2007, *SEMEVAL.

[3]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[4]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[5]  Susumu Horiguchi,et al.  Personal Name Resolution Crossover Documents by a Semantics-Based Approach , 2006, IEICE Trans. Inf. Syst..

[6]  Véronique Hoste,et al.  AUG: A combined classification and clustering approach for web people disambiguation , 2007, SemEval@ACL.

[7]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[8]  M. de Rijke,et al.  Personal Name Resolution of Web People Search , 2008 .

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Danushka Bollegala,et al.  Disambiguating Personal Names on the Web Using Automatically Extracted Key Phrases , 2006, ECAI.

[11]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[12]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[13]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[14]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[15]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[18]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[19]  Jean-Louis Deneubourg,et al.  The dynamics of collective sorting robot-like ants and ant-like robots , 1991 .

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[22]  Atsuhiro Takasu,et al.  Improving the performance of personal name disambiguation using web directories , 2008, Inf. Process. Manag..

[23]  Dmitri V. Kalashnikov,et al.  Towards breaking the quality curse.: a web-querying approach to web people search. , 2008, SIGIR '08.

[24]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[25]  Baldo Faieta,et al.  Diversity and adaptation in populations of clustering ants , 1994 .

[26]  Chris Cornelis,et al.  Clustering web search results using fuzzy ants , 2007, Int. J. Intell. Syst..

[27]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[28]  Nicolas Monmarché,et al.  Algorithmes de fourmis artificielles : applications à la classification et à l'optimisation. (Artificial ant based algorithms applied to clustering and optimization problems) , 2000 .

[29]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .