Clustering web people search results using fuzzy ants

Person name queries often bring up web pages that correspond to individuals sharing the same name. The Web People Search (WePS) task consists of organizing search results for ambiguous person name queries into meaningful clusters, with each cluster referring to one individual. This paper presents a fuzzy ant based clustering approach for this multi-document person name disambiguation problem. The main advantage of fuzzy ant based clustering, a technique inspired by the behavior of ants clustering dead nestmates into piles, is that no specification of the number of output clusters is required. This makes the algorithm very well suited for the Web Person Disambiguation task, where we do not know in advance how many individuals each person name refers to. We compare our results with state-of-the-art partitional and hierarchical clustering approaches (k-means and Agnes) and demonstrate favorable results. This is particularly interesting as the latter involve manual setting of a similarity threshold, or estimating the number of clusters in advance, while the fuzzy ant based clustering algorithm does not.

[1]  Eduard Hovy,et al.  Multi-Document Person Name Resolution , 2004 .

[2]  Chris Cornelis,et al.  Clustering web search results using fuzzy ants , 2007, Int. J. Intell. Syst..

[3]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[8]  Nicolas Monmarché,et al.  Algorithmes de fourmis artificielles : applications à la classification et à l'optimisation. (Artificial ant based algorithms applied to clustering and optimization problems) , 2000 .

[9]  William M. Shaw,et al.  On the foundation of evaluation , 1986, J. Am. Soc. Inf. Sci..

[10]  Vibhu O. Mittal,et al.  Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.

[11]  James C. Bezdek,et al.  Visual Assessment of Clustering Tendency for Rectangular Dissimilarity Matrices , 2007, IEEE Transactions on Fuzzy Systems.

[12]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[13]  Roelof K. Brouwer,et al.  Fuzzy Clustering and Mapping of Ordinal Values to Numerical , 2007, 2007 IEEE Symposium on Foundations of Computational Intelligence.

[14]  Véronique Hoste,et al.  AUG: A combined classification and clustering approach for web people disambiguation , 2007, SemEval@ACL.

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  Jing Zhao,et al.  Document Clustering Based on Nonnegative Sparse Matrix Factorization , 2005, ICNC.

[17]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[18]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[19]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[20]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[21]  Atsuhiro Takasu,et al.  Improving the performance of personal name disambiguation using web directories , 2008, Inf. Process. Manag..

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Shihong Yue,et al.  A new separation measure for improving the effectiveness of validity indices , 2010, Inf. Sci..

[24]  S. Sekine,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, *SEMEVAL.

[25]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[26]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[27]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[28]  R. Ash The Top 10 of Everything , 1989 .

[29]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[30]  Martine De Cock,et al.  Fuzzy Ants Clustering for Web People Search , 2009 .

[31]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[32]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Preface , 2008 .

[33]  Babak Rezaee,et al.  A cluster validity index for fuzzy clustering , 2010, Fuzzy Sets Syst..

[34]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[35]  Chris Cornelis,et al.  Clustering web search results using fuzzy ants: Research Articles , 2007 .

[36]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[37]  Baldo Faieta,et al.  Diversity and adaptation in populations of clustering ants , 1994 .

[38]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[39]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[40]  Mohamed S. Kamel,et al.  A SOM-based document clustering using phrases , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[41]  Jean-Louis Deneubourg,et al.  The dynamics of collective sorting robot-like ants and ant-like robots , 1991 .

[42]  Danushka Bollegala,et al.  Disambiguating Personal Names on the Web Using Automatically Extracted Key Phrases , 2006, ECAI.

[43]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .