Large-scale Document Clustering for Associative Document Search

Approximated algorithms for clustering large-scale document collection are proposed and evaluated under the context of cluster-based document retrieval (i.e., associative document search). These algorithms use a precise clustering algorithm as a subroutine to construct a strati ed structure of cluster trees. An experiment showed that more than 100 times speedup in cpu time was gained at best. Through experiments of self retrieval and topic assignment, we con rmed su cient search performance on cluster trees that are constructed by approximated algorithms. In particular, top down construction o ered over 99% accuracy of self retrieval which is comparable performance to exhaustive search. Top down construction also o ered promising performance in topic assignment, that is, better recall/precision than that obtained by exhaustive search. All of the results for cluster-based retrieval were obtained by simple and e cient binary tree search.

[1]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[2]  Hsinchun Chen,et al.  Interactive term suggestion for users of digital libraries: using subject thesauri and co-occurrence lists for information retrieval , 1996, DL '96.

[3]  Takenobu Tokunaga,et al.  Hierarchical Bayesian Clustering for Automatic Text Classification , 1995, IJCAI.

[4]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[5]  Takenobu Tokunaga,et al.  A Probabilistic Model for Text Categorization: Based on a Single Random Variable with Multiple Values , 1994, ANLP.

[6]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[7]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[8]  Peter Willett,et al.  The limitations of term co-occurrence data for query expansion in document retrieval systems , 1991, J. Am. Soc. Inf. Sci..

[9]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[10]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[11]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Alan F. Smeaton,et al.  The Retrieval Effects of Query Expansion on a Feedback Document Retrieval System , 1983, Comput. J..

[14]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[15]  C. J. van Rijsbergen,et al.  Further experiments with hierarchic clustering in document retrieval , 1974, Inf. Storage Retr..

[16]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .