A new measure of clustering effectiveness: Algorithms and experimental studies

We propose a new optimal clustering effectiveness measure, called CS1, based on a combination of clusters rather than selecting a single optimal cluster as in the traditional MK1 measure. For hierarchical clustering, we present an algorithm to compute CS1, defined by seeking the optimal combinations of disjoint clusters obtained by cutting the hierarchical structure at a certain similarity level. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution can be obtained by a linear time algorithm. We further discuss how our approach can be generalized to more general problems involving overlapping clusters, and we show how optimal estimates can be obtained by greedy algorithms. © 2008 Wiley Periodicals, Inc.

[1]  Ian Soboroff,et al.  A comparison of pooled and sampled relevance judgments , 2007, EVIA@NTCIR.

[2]  Amanda Spink,et al.  Web searching on the Vivisimo search engine , 2006, J. Assoc. Inf. Sci. Technol..

[3]  Dik Lun Lee,et al.  Query-specific clustering of search results based on document-context similarity scores , 2006, CIKM '06.

[4]  Shi Zhong,et al.  Semi-supervised model-based document clustering: A comparative study , 2006, Machine Learning.

[5]  Korris Fu-Lai Chung,et al.  Text Categorization Based on Subtopic Clusters , 2005, NLDB.

[6]  Ismail Sengör Altingövde,et al.  Efficiency and effectiveness of query processing in cluster-based retrieval , 2004, Inf. Syst..

[7]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[8]  Noriko Kando,et al.  Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop , 2004, Information Retrieval.

[9]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[10]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[11]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[12]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[13]  Alan F. Smeaton,et al.  The effect of pool depth on system evaluation in TREC , 2001, J. Assoc. Inf. Sci. Technol..

[14]  Makoto Iwayama,et al.  Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering , 2000, SIGIR '00.

[15]  Robert D. Carr,et al.  On the red-blue set cover problem , 2000, SODA '00.

[16]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[17]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[18]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[19]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[20]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[21]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[22]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[23]  Pierre Hansen,et al.  Hyperbolic 0–1 programming and query optimization in information retrieval , 1991, Math. Program..

[24]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[25]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[26]  P. Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997, J. Am. Soc. Inf. Sci..

[27]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[28]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[29]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[30]  P. Robillard (0, 1) hyperbolic programming problems , 1971 .