A practical clustering algorithm for static and dynamic information organization

We present and analyze the off-line star algorithm for clustering static information systems and the on-line star algorithm for clustering dynamic information systems. These algorithms organize a document collection into a number of clusters that is naturally induced by the collection via a computationally efficient cover by dense subgraphs. We further show a lower bound on the accuracy of the clusters produced by these algorithms as well as demonstrate that these algorithms are efficient (running times roughly linear in the size of the problem). Finally, we provide data from a number of

[1]  Béla Bollobás,et al.  Random Graphs , 1985 .

[2]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[3]  Daniela Rus,et al.  Static and dynamic information organization with star clusters , 1998, CIKM '98.

[4]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[5]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[6]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[7]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[8]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[9]  James Allan,et al.  Automatic Hypertext Construction , 1995 .

[10]  M. Aldenderfer Cluster Analysis , 1984 .

[11]  William Pugh,et al.  Skip lists: a probabilistic alternative to balanced trees , 1989, CACM.

[12]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[13]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[14]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[15]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[16]  Daniela Rus,et al.  Transportable Information Agents , 1997, Agents.

[17]  Guy Kortsarz,et al.  On choosing a dense subgraph , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[18]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[19]  Carsten Lund,et al.  On the hardness of approximating minimization problems , 1994, JACM.

[20]  Philip G. Johnson Cornell University , 1897, The Journal of comparative medicine and veterinary archives.

[21]  W. Bruce Croft Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[22]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[23]  Daniela Rus,et al.  Generating, Visualizing, and Evaluating High-Quality Clusters for Information Organization , 1998, PODDP.

[24]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[25]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[26]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[27]  Gerard Salton,et al.  The smart document retrieval project , 1991, SIGIR '91.

[28]  David Zuckerman,et al.  NP-complete problems have a version that's hard to approximate , 1993, [1993] Proceedings of the Eigth Annual Structure in Complexity Theory Conference.

[29]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[30]  Gordon M. Crippen,et al.  Distance Geometry and Molecular Conformation , 1988 .

[31]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.