Scatter/Gather: a cluster-based approach to browsing large document collections

Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.

[1]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[2]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[3]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[4]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[5]  W. Bruce Croft Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[6]  Peter Willett,et al.  Document clustering using an inverted file approach , 1980 .

[7]  Peter Willett A fast procedure for the calculation of similarity coefficients in automatic classification , 1981, Inf. Process. Manag..

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[10]  P. Willett,et al.  Using interdocument similarity information in document retrieval systems , 1997, J. Am. Soc. Inf. Sci..

[11]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[12]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[13]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[14]  Jan O. Pedersen,et al.  Snippet Search: a Single Phrase Approach to Text Access , 1991 .