Journal of Graph Algorithms and Applications the Star Clustering Algorithm for Static and Dynamic Information Organization

We present and analyze the off-line star algorithm for clustering static information systems and the on-line star algorithm for clustering dynamic information systems. These algorithms organize a document collection into a number of clusters that is naturally induced by the collection via a computationally efficient cover by dense subgraphs. We further show a lower bound on the quality of the clusters produced by these algorithms as well as demonstrate that these algorithms are efficient (running times roughly linear in the size of the problem). Finally, we provide data from a number of experiments. Article Type Communicated by Submitted Revised regular paper S. Khuller December 2003 August 2004 Research supported in part by ONR contract N00014-95-1-1204, DARPA contract F30602-98-2-0107, and NSF grant CCF-0418390. J. Aslam et al., The Star Clustering Algorithm, JGAA, 8(1) 95–129 (2004) 96

[1]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[2]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[3]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[4]  Gordon M. Crippen,et al.  Distance Geometry and Molecular Conformation , 1988 .

[5]  Nathan Linial,et al.  The geometry of graphs and some of its algorithmic applications , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[6]  Carsten Lund,et al.  On the hardness of approximating minimization problems , 1994, JACM.

[7]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[8]  James Allan,et al.  Automatic Hypertext Construction , 1995 .

[9]  Daniela Rus,et al.  Digital Digital Transportable Information Agents Transportable Information Agents , 1996 .

[10]  Guy Kortsarz,et al.  On choosing a dense subgraph , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[11]  Noga Alon,et al.  The online set cover problem , 2003, STOC '03.

[12]  Gerard Salton,et al.  The smart document retrieval project , 1991, SIGIR '91.

[13]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[14]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[15]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[16]  W. Bruce Croft Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[17]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[18]  Daniela Rus,et al.  Generating, Visualizing, and Evaluating High-Quality Clusters for Information Organization , 1998, PODDP.

[19]  David Zuckerman,et al.  NP-complete problems have a version that's hard to approximate , 1993, [1993] Proceedings of the Eigth Annual Structure in Complexity Theory Conference.

[20]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[21]  Daniela Rus,et al.  Static and dynamic information organization with star clusters , 1998, CIKM '98.

[22]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[23]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[24]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[25]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[26]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[27]  William Pugh,et al.  Skip lists: a probabilistic alternative to balanced trees , 1989, CACM.

[28]  Daniela Rus,et al.  A practical clustering algorithm for static and dynamic information organization , 1999, SODA '99.

[29]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[30]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[31]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[32]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[33]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.