Incremental clustering and dynamic information retrieval

Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance, and which we believe should also perform well in practice. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters.

[1]  Anton E. Mayer Theorie der konvexen Körper , 1936 .

[2]  C. A. Rogers A note on coverings , 1957 .

[3]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[7]  Bernhard Korte,et al.  Optimization and Operations Research , 1976 .

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  O. Kariv,et al.  An Algorithmic Approach to Network Location Problems. I: The p-Centers , 1979 .

[10]  S. L. HAKIMIt AN ALGORITHMIC APPROACH TO NETWORK LOCATION PROBLEMS. , 1979 .

[11]  O. Kariv,et al.  An Algorithmic Approach to Network Location Problems. II: The p-Medians , 1979 .

[12]  Robert J. Fowler,et al.  Optimal Packing and Covering in the Plane are NP-Complete , 1981, Inf. Process. Lett..

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[15]  Peter Scheuermann,et al.  A Global Approach to Record Clustering and File Reorganization , 1984, SIGIR.

[16]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[17]  Wolfgang Maass,et al.  Approximation schemes for covering and packing problems in image processing and VLSI , 1985, JACM.

[18]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[19]  Editors , 1986, Brain Research Bulletin.

[20]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[21]  Fazli Can,et al.  A dynamic cluster maintenance system for information retrieval , 1987, SIGIR '87.

[22]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[23]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  Gerald Salton,et al.  Automatic text processing , 1988 .

[26]  Fazli Can,et al.  Incremental clustering for dynamic document databases , 1990, Proceedings of the 1990 Symposium on Applied Computing.

[27]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[28]  Edie M. Rasmussen,et al.  Clustering Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[29]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[30]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[31]  Rajeev Motwani,et al.  Non-clairvoyant scheduling , 1994, SODA '93.

[32]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[33]  Bidyut Baran Chaudhuri,et al.  Dynamic clustering for time incremental data , 1994, Pattern Recognit. Lett..

[34]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[35]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[36]  János Pach,et al.  Combinatorial Geometry , 2012 .

[37]  Douglas W. Oard,et al.  A survey of information retrieval and filtering methods , 1995 .

[38]  Cynthia A. Phillips,et al.  Improved Scheduling Algorithms for Minsum Criteria , 1996, ICALP.

[39]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[40]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[41]  Sandy Irani,et al.  Online computation , 1996 .

[42]  Dorit S. Hochbaum,et al.  Various notions of approximations: good, better, best, and more , 1996 .

[43]  Jon M. Kleinberg,et al.  An improved approximation ratio for the minimum latency problem , 1996, SODA '96.

[44]  Gobinda G. Chowdhury Introduction to Modern Information Retrieval , 1999 .